The overall goal of this procedure is to assess, assemble, annotate and compare differential gene expression through De Novo transcriptomics, starting from raw FASTQ files. This method can help answer questions in comparative and molecular biology, including which transcripts are inside of an organism, what those transcripts are doing inside of those organisms, and what are the differences between experimental conditions. The main advantage of this technique is that it provides an interactive environment.
It provides on demand computational resources and it allows researchers to start immediately analyzing their RNA-Seq data. This method is particularly useful for researchers comparing experiments within a single organism that involves multiple tissues, conditions, time points to understand how biological systems change. This method is focused on non-model organisms without genomes, but can also be applied to organisms with genome assemblies available, even those with tens or hundreds of thousands of scaffolds in their assembly.
To begin, get access to Atmosphere in the Discovery Environment. Request a free CyVerse account by navigating to the registration page. Use an institutional email to register for the account.
Next, navigate to the apps and services tab and request access to Atmosphere. Access to the Discovery Environment is automatically granted. Log in to the Discovery Environment, abbreviated as DE.Then, select the Data tab to bring up a menu containing all of the folders in the data store.
Create a main project folder that will house all of the data associated with the project. Find the toolbar at the top of the data window and select File, New Folder. Do not use spaces or special characters in the folder names or any input output file names.
Instead, use underscores or dashes where appropriate. Upload raw FASTQ sequence files and the folder, 1_Raw_Sequence, into a sub folder entitled Folder A_Raw_Reads. For files under two gigabits, use the data store's simple upload feature to navigate to the data window toolbar by clicking on the Data button in the main DE desktop.
Select Upload, Simple Upload from desktop. Then, select the Browse button to navigate to the raw FASTQ sequencing files on the local computer. Assess uploaded raw sequencing reads using the FastQC app in the DE.Select the Apps button on the main DE desktop to open a window containing all of the analysis apps available in the DE.Search the window for the FastQC tool in the search toolbar at the top of the window.
Open the multi-file version if there is more than one FASTQ file. Select File and create a new folder, then select this folder as the output folder. Load the FASTQ read files into the tool window called Select Input Data and select Launch Analysis.
Search for the programmable Trimmomatic app in the DE and open it. Upload the folder of raw FASTQ read files into the settings section. Select whether the sequencing files are single or paired-end.
Use the standard control file provided by selecting the Browse button and pasting the file path into the viewing box. Select the Trimmomatic control file and launch the analysis. For quality trimming sequence reads, search and open the Sickle app in the DE.Select the trimmed FASTQ reads as input reads and rename output files.
Include quality settings in the options. Open the most current version of the Atmosphere instance by navigating to the wiki page. Select the link for the most recent version of the Trinity and Trinotate image.
Select the Login To Launch button and then name the Atmosphere instance. Select an instance size of either medium3 or large3. Launch the instance and wait for it to build.
If an Atmosphere image fails to spin up, you can try to apply for a smaller instance or you can apply to Jetstream for a larger allocation. All of the details are on the companion wiki. Move the Trinity output files into the folder, 3_Assembly, in the DE and label the folder, A_Trinity_de_novo_assembly.
Running Trinity requires command line knowledge and several days or possibly weeks to complete large analyses. There are free resources available that are linked on the wiki to help understand the command line. Give each transcriptome that was assembled a sub folder inside the A_Trinity_de_novo_assembly folder.
Use unique names including the scientific names of organisms and treatments associated with each transcriptome, then create another sub folder called Folder B_rnaQUAT_Output in the 3_Assembly folder. Open the app titled De Novo rnaQUAST. Name the analysis and select Folder B_rnaQUAST_Output as the output folder.
Search for transcript decoder and run transdecoder on the De Novo Trinity Assembly output fasta file in the Discovery Environment. Open the deseq2 app in the DE.Name the analysis and select the output folder as 4_Differential_Expresssion. In the Input section, select the counts table file from the Trinity Assembly run.
Also, select the column where the contig names can be found. Input the column headers from the counts data table file to determine which columns are compared. Include the commas between each of the conditions.
Do not include the first column header that contains the contig names. For replicates, repeat the same name. In the second line, provide the names of the two conditions to be compared.
Match the column header names provided in the first line. Shown here is a systematic comparison of sequencing reads after each pre-processing step. After trimming, the read should have less skewed GC content and sequence content, and have a greater proportion to reads with a high quality score.
High quality reads are necessary to assemble De Novo transcriptomes. Results from fast QC depend on the organisms and samples being sequenced. By uniformity across all samples that will be compared downstream is the primary goal of pre-processing reads.
rnaQUAST to leverages boost code to generate summary statistics about assemblies based on known core genes in taxonomic clades. The accuracy of assemblers is revealed by the number of mismatches per transcript, and how many transcripts match to canonical genes. The last four subplots presented here provide summary statistics of contig and isoform length, as well as the coverage of expected isoforms.
NAx represents the percentage of contigs with a lengthful longer than the length of the y-axis. Assembled fraction is the longest single assembled transcript divided by its length. Where as covered fraction is the percentage of complete assembled transcript's isoforms as expected by the core prokaryotic or eukaryotic genes from BUSCO.
After watching this video, you should have a good understanding of how to assemble and enter transcriptomes. Additionally, this protocol will allow you to detect differential gene expression between two conditions. Generally, individuals struggle with bioinformatic packages because there's so many of them, there's a lot of settings and variables associated with them, and usually you have to have knowledge of the command line to actually execute.
It's important to label and organize your data inputs and analysis outputs so that other researchers can understand what was done. You should include the order steps were completed, program versions and sample information. Also, omit any spaces in the folder or file names.
New tools and new versions of the tools are being integrated constantly, but old versions of the tools are also being kept. All changes will be recorded on the companion wiki. Following this procedure, other bioinformatic methods like network analysis, GO enrichment, and metabolic pathway identification can be performed to help answer questions like phenotype variation, conditions that change expression profiles, and identifying genes of interest for functional genomics.