Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.
Method Article
The purpose of this protocol is to use a combination of computational and bench research to find novel sequences that cannot be easily separated from a co-purifying sequence, which may be only partially known.
Subtractive genomics can be used in any research where the goal is to identify the sequence of a gene, protein, or general region that is embedded in a larger genomic context. Subtractive genomics enables a researcher to isolate a target sequence of interest (T) by comprehensive sequencing and subtracting out known genetic elements (reference, R). The method can be used to identify novel sequences such as mitochondria, chloroplasts, viruses, or germline restricted chromosomes, and is particularly useful when T cannot be easily isolated from R. Beginning with the comprehensive genomic data (R + T), the method uses Basic Local Alignment Search Tool (BLAST) against a reference sequence, or sequences, to remove the matching known sequences (R), leaving behind the target (T). For subtraction to work best, R should be a relatively complete draft that is missing T. Since sequences remaining after subtraction are tested through quantitative Polymerase Chain Reaction (qPCR), R does not need to be complete for the method to work. Here we link computational steps with experimental steps into a cycle that can be iterated as needed, sequentially removing multiple reference sequences and refining the search for T. The advantage of subtractive genomics is that a completely novel target sequence can be identified even in cases in which physical purification is difficult, impossible, or expensive. A drawback of the method is finding a suitable reference for subtraction and obtaining T-positive and negative samples for qPCR testing. We describe our implementation of the method in the identification of the first gene from the germline-restricted chromosome of zebra finch. In that case computational filtering involved three references (R), sequentially removed over three cycles: an incomplete genomic assembly, raw genomic data, and transcriptomic data.
The purpose of this method is to identify a novel target (T) genomic sequence, either DNA or RNA, from a genomic context, or reference (R) (Figure 1). The method is most useful if the target cannot be physically separated, or it would be expensive to do so. Only a few organisms have perfectly finished genomes for subtraction, so a key innovation of our method is the combination of computational and bench methods into a cycle enabling researchers to isolate target sequences when the reference is imperfect, or a draft genome from a non-model organism. At the end of a cycle, qPCR testing is used to determine whether more subtraction is needed. A validated candidate T sequence will show statistically greater detection in known T-positive samples by qPCR.
Incarnations of the method have been implemented in discovery of new bacterial drug targets that do not have host homologs1,2,3,4 and identification of novel viruses from infected hosts5,6. In addition to identification of T, the method can improve R: we recently used the method to identify 936 missing genes from the zebra finch reference genome and a new gene from a germline-only chromosome (T)7. Subtractive genomics is particularly valuable when T is likely to be extremely divergent from known sequences, or when the identity of T is broadly undefined, as in the zebra finch germline-restricted chromosome7.
By not requiring positive identification of T beforehand, a key advantage of subtractive genomics is that it is unbiased. In a recent study, Readhead et al. examined the relationship between Alzheimer's disease and viral abundance in four brain regions. For viral identification, Readhead et al. created a database of 515 viruses8, severely limiting the viral agents that their study could identify. Subtractive genomics could have been used to compare the healthy and Alzheimer's genomes in order to isolate possible novel viruses associated with the disease, regardless of their similarity to known infectious agents. While there are 263 known human-targeting viruses, it has been estimated that approximately 1.67 million undiscovered viral species exist, with 631,000-827,000 of them having a potential to infect humans9.
Isolation of novel viruses is an area in which subtractive genomics is particularly effective, but some studies may not need such a stringent method. For example, studies identifying novel viruses have used unbiased high-throughput sequencing followed by reverse transcription and BLASTx for viral sequences5 or enriching of viral nucleic acids to extract and reverse transcribe viral sequences6. While these studies employed de novo sequencing and assembly, subtraction was not used because the target sequences were positively identified through BLAST. If the viruses were completely novel and not related (or distantly related) to other viruses, subtractive genomics would have been a useful technique. The benefit of subtractive genomics is that sequences that are completely new can be obtained. If the organism's genome is known, it can be subtracted out to leave any viral sequences. For example, in our published study we isolated a novel viral sequence from zebra finch through subtractive genomics, though it was not our original intent7.
Subtractive genomics has also proved useful in the identification of bacterial vaccine targets, motivated by the dramatic rise in antibiotic resistance1,2,3,4. To minimize the risk of autoimmune reaction, researchers narrowed down the potential vaccine targets by subtracting any proteins that have homologs in the human host. One particular study, looking at Corynebacterium pseudotuberculosis, performed subtraction of vertebrate host genomes from several bacterial genomes to ensure that possible drug targets would not affect proteins in the hosts leading to side effects1. The basic work flow of these studies is to download the bacterial proteome, determine vital proteins, remove redundant proteins, use BLASTp to isolate the essential proteins, and BLASTp against host proteome to remove any proteins with host homologs1,2,3,4. In this case, subtractive genomics ensure that the vaccines developed will not have any off-target effects in the host1,2,3,4.
We used subtractive genomics to identify the first protein-coding gene on a germline-restricted chromosome (GRC) (in this case, T), which is found in germlines but not somatic tissue of both sexes10. Before this study, the only genomic information that was known about the GRC was a repetitive region11. De novo assembly was performed on RNA sequenced from ovary and teste tissues (R+T) from adult zebra finches. The computational elimination of sequences was performed using published somatic (muscle) genome sequence (R1)12, its raw (Sanger) read data (R2), and a somatic (brain) transcriptome (R3)13. The sequential use of three references was driven by the qPCR testing at step 5 of each cycle (Figure 2A), showing that additional filtering was required. The discovered α-SNAP gene was confirmed through qPCR from DNA and RNA, and cloning and sequencing. We show in our example that this method is flexible: it is not dependent on matching nucleic acids (DNA vs RNA) and that subtraction can be performed with references (R) that are comprised of assemblies or raw reads.
1. De novo Assemble Starting Sequence
NOTE: Any Next-Generation Sequence (NGS) data can be used, as long as an assembly can be produced from those data. Suitable input data includes Illumina, PacBio, or Oxford Nanopore reads assembled into a fasta file. For concreteness, this section describes an Illumina-based transcriptomic assembly specific to the zebra finch study we performed7; however be aware that the specifics will vary by project. For our example project, raw data were derived from a MiSeq and approximately 10 million paired reads were obtained from each sample.
2. BLAST the Assembly against the Reference Sequence
NOTE: Use this step when the reference is an assembly or long reads like Sanger; if it is composed of raw Illumina reads, see step 3 below for mapping reads to the query. All BLAST steps were completed with version 2.2.29+ though the commands should work on any recent BLAST version.
3. Map Reads onto the Assembly
NOTE: This method can be used if the reference dataset consists of raw genomic reads, rather than assembled sequences or Sanger sequences, in which case use BLAST (step 2.1).
4. Use Python Script to Remove any Matching Sequences
NOTE: Provided scripts work with Python 2.7.
5. Design Primers for the Sequence that Remains
NOTE: At this point there is a fasta file containing candidate T sequences. This section describes qPCR to experimentally test whether they come from T or from previously unknown regions of R. If the subtraction in step 4 removed all sequences, then either the initial assembly failed to include T, or the subtraction may have been too stringent.
6. qPCR Validation of the Remaining Sequence
NOTE: This step requires primers validated and PCR conditions established in step 5.
7. Repeat with a New Reference to Pare Down the Data.
NOTE: If step 6 validated the identified sequences from T, end the cycle here (Figure 2A). However, a variety of considerations may motivate a continuation of the cycle, for example if many R sequences remain in the file or if none of the candidate T sequences were validated by qPCR in step 6.
After running BLAST, the output file will have a list of sequences from the query that match the database. After Python subtraction, a number of nonmatching sequences will be obtained, and tested by qPCR. The results of this, and next steps, are discussed below.
Negative result. There are two possible negative results that can be seen after BLAST to the reference sequence. There may be no BLAST results, meaning ...
While subtractive genomics is powerful, it is not a cookie-cutter approach, requiring customization at several key steps, and careful selection of reference sequences and test samples. If the query assembly is of poor quality, filtering steps might only isolate assembly artifacts. Therefore, it is important to thoroughly validate the de novo assembly using an appropriate validation protocol to the specific project. For RNA-seq, guidelines are provided on the Trinity website18 and for DNA,...
The authors have nothing to disclose.
The authors acknowledge Michelle Biederman, Alyssa Pedersen, and Colin J. Saldanha for their assistance with the zebra finch genomics project at various stages. We also acknowledge Evgeny Bisk for computing cluster system administration and NIH grant 1K22CA184297 (to J.R.B.) and NIH NS 042767 (to C.J.S).
Name | Company | Catalog Number | Comments |
Accustart II Taq DNA Polymerase | Quanta Bio | 95141 | |
Blasic Local Alignment Search Tool (BLAST) | https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment | ||
Bowtie 2 | https://www.python.org/download/releases/2.7/ | ||
BWA-MEM v. 0.7.12 | https://github.com/BenLangmead/bowtie2 | ||
Geneious | https://blast.ncbi.nlm.nih.gov/Blast.cgi | ||
PEAR v. 0.9.6 | http://www.mybiosoftware.com/reptile-1-1-short-read-error-correction.html | ||
Personal Computer | Biomatters | http://www.geneious.com/ | |
PowerSYBR qPCR mix | ThermoFisher | 4367659 | |
Python v. 2.7 | https://sco.h-its.org/exelixis/web/software/pear/ | ||
Reptile v.1.1 | https://alurulab.cc.gatech.edu/reptile | ||
Stratagene Mx3005P | Agilent Technologies | 401456 | |
TransDecoder v. 3.0.1 | https://sourceforge.net/projects/bio-bwa/files/ | ||
Trinity v. 2.4.0 | https://github.com/TransDecoder/TransDecoder/wiki |
Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE
Zapytaj o uprawnieniaThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone