Our protocol is significant because it can be used to identify genomic sequences that cannot be isolated from co-purifying sequences, which may themselves only be partially known. The main advantages of this technique is that it's inexpensive, using mostly free software that can be downloaded as well as flexible. You can apply it to many biological questions.
Potential applications include identifying bacterial vaccine targets and identifying viruses whose sequences are extremely different from known microbes. This method can be applied to any system in which the unknown can not be experimentally separated, as long as the reference sequence without the genomic target is available. This technique takes some trial and error, so patience is important.
You may need to trouble shoot the programs. You may use programs different from the ones we describe here. Use the user manuals as much as possible.
Visual demonstration of this method is helpful because computational work is dependent on a basic understanding of how command line programming is structured. To begin, use Trimmomatic 0.32 to remove illumina adapters and low quality bases. Use Pear version 0.9.11 to create high quality merged reads from trimmomatic output paired reads using default parameters.
Then use Reptile version 1.1 to error correct the reads produced through Pear. Finally use Trinity version 2.4.0 in default mode to assemble the corrected sequences. For strand specific libraries, use the SS_lib_type parameter.
The output is a FASTA file that will be placed in a new directory called trinity_output. Make a BLAST database of the reference sequence nucleotide_reference. fasta at the command line.
BLAST matched the query assembled to the reference database. To obtain an output file, use BLAST_results. txt To generate tabular output required for subsequence processing steps with Python scripts, use outfmt 6 For increased stringency, use protein sequences from the assembly as the BLAST query with translated nucleotide BLAST, which performs six-way translation of the nucleotide database.
To obtain protein sequences for the query, Run the TransDecoder. Long0rfs command to identify the longest open reading frames from assembled query sequences. Now run tblastn.
If necessary, ensure the correct genetic code is selected for the organism being studied using the db_gencode with appropriate code option. If a high quality protein reference is available, use protein-protein matching with blastp rather than tblastn Make a BLAST database of the protein reference. Make sure to save the result as a file for downstream processing, and use tabular output to ensure the Python scripts can parse them correctly.
Now, use the subtractive Python script to remove any matching sequences. To map the reads onto the assembly, use BWA-MEM Version 0.7.12 or bowtie 2 to map the downloaded raw reads onto the query assembly. First, index the assembly, then map the reads.
The output will be SAM format. Run the Python script removeUnmapped. py using the SAM file as input.
This identifies the names of query sequences without any matching reads and saves them to a new text file. The output of the previous step is a list of sequences names in a txt file. Extract a FASTA file with these sequences.
The output will be a fasta file. Use Genius to determine optimal primer sequences manually. Highlight a candidate sequence of 21 to 28 base pairs for the forward primer, avoiding runs of four or more of any base.
Try to target a region with a fairly uniform combination of all base pairs. A single G or C at three Prime End is beneficial, helping to anchor the primer. Click on the Statistics tab on the right-hand side of the screen to view that sequence's estimated melting temperature as the candidate region is highlighted.
Aim for a melting temperature between 55 and 60 degrees Celsius, while avoiding repeats, and long runs of G C.Choose a reverse primer in the same manner, situated 150 to 250 base pairs three prime of the forward primer. While the primer lengths do not need to match, the predicted melting temperature should be as close as possible to that of the forward primer. Be sure to reverse complement the sequence by right-clicking in Genius while a sequence is highlighted in a menu option.
An alternative method is to use the Primer Design function, which is found in the top toolbar in the Sequence window. Insert the region to amplify under Target Region. Under the characteristics tab, insert desired size, melting temperature, and per cent G C, then click OK to have primers generated.
To perform quantitative PCR validation of the remaining sequence, first prepare a reaction mixture for each template in triplicate, with our SYBR Green Master Mix, forward and reverse primers, and water, for a total volume of 25 microlitres. Run a qPCR program informed by the previously validated temperature and extension time. Final denaturing curves should be generated at least the first time the primers are employed in qPCR to validate the amplification of a single DNA product.
Measure the qPCR SYBR Green signals relative to Actin by Ct For all cases, calculate the average, and standard deviation of two to the Ct relative to Actin. Perform endpoint gel electrophoresis, to confirm correct product size detection by qPCR. Here, run 25 microlitres of the qPCR product mixed with five microlitres of 6X Glitterol dye on a 2%TAE Agarose gel at 200 volts for 20 minutes.
If your qPCR shows that you have not identified the target sequences, repeat the whole cycle with a new reference, which may be obtained from an online database. The subtractive project in this case started with sequencing the RNA from germ-lined tissue of male and female adult zebra finch. Ultimately, 935 somatic genes that were not previously included in the whole genome annotation were identified.
After computational filtering, quantitative PCR may yield a negative result in which there was no difference in detection across bird tissues. Conversely, a positive result representing the identification of a True Target Sequence is confirmed when genomic DNA qPCR shows statistically greater detection in the tissue of interest, relative to the reference. Here, the alpha snap gene was validated to be germline-restricted because it was depleted in somatic tissue relative to testees DNA where it was present that levels equivalent to Actin.
It is important to remember to use the correct inputs during each of the computational steps. You may have to go through the cycle subtraction multiple times in order to obtain the target sequence or sequences. A variety of phylogenetic, structural, and functional analysis can be performed on the discovered genes.
These additional methods give insight into the evolutionary and functional roles of the genes. We identified the first gene on a germline-restricted songbird chromosome, which broadened interest in this surprising genomic element. Subsequent work showed similar chromosomes in many songbird species containing many additional genes.