The overall goal of this protocol is to help C.elegans researchers without bioinformatics expertise to conduct an RNA sequencing experiment and analyze the data using the open access platform Galaxy. This method can help analyze complex high through put sequencing data to provide insights into the transcriptomic signatures behind phenotypes in C.elegans. The main advantage of this technique is that it enables scientists with no prior bioinformatics training to analyze raw sequencing data and produce a list of differentially expressed genes, along with their associated gene ontology terms.
Though this method is specifically geared towards the nuances of C.elegans sequencing data it can also be applied to other organisms and biological context where genomes key transcription changes need to be examined. This video deals with the web based usage of the Galaxy pipeline. If feasible, run the work flow locally.
Download and install Galaxy according to instructions on the wiki page. After utilizing the start here tutorial provided on the Galaxy homepage, follow this video. It is critical for the user to familiarize and orient themselves to the Galaxy user interface and the tools used in the work flow.
To begin, click on analyze data in the header panel to access the analysis home view. The progress bar in the upper right monitors how much disc space has been used. Next, access the tools menu on the left panel and click on NGS RNA analysis.
This provides options to use all the tools required for analysis of RNA sequence data. Now, start a new analysis history. Go to the history panel on the right.
Click on the gear icon and choose the create new option from the pop up menu. Then, provide a name under history to identify the analysis. To proceed, go to the tools menu, and under get data, click the upload file function to upload raw fast queue files.
After the task opens in the analysis interface, click on choose local file, or choose FTP file to navigate to and select the appropriate sequence data. By default, Galaxy will automatically detect the file type. Then select the organism from the pull down menu which in this case is C.elegans.
Next, click on start to initiate the data upload. After the file is uploaded, the action is saved in the history panel from which the data can be selected and accessed. Now, convert files from the fast queue format one at a time to the fast queue sanger format by accessing the NGS QC and manipulation menu, selecting fast queue groomer, choosing the file under file to groom.
Select the appropriate input fast equality scores type and run the tool using the default parameters. The data file can now be analyzed. Pay special attention to the file formats and test parameters used throughout the protocol.
This knowledge is valuable for troubleshooting failed tests and other issues. Quality control tests can be used before proceeding. Details on running them are provided in the text protocol.
Once a file is ready to analyze map the sequencing data by first opening the NGS RNA analysis section and then clicking on the top hat tool. From the drop down menu, fill in the answer to the question is this single or paired end data? Next, choose the appropriate fast queue file.
Select use a built in genome in the next drop down menu and choose the reference C.elegans genome data. Select default for all other parameter options, and then click execute. To estimate the relative abundance of transcripts in the data set select the cuff links tool in the NGS RNA analysis section and from the first drop down menu choose the mapped accepted hits bam format file obtained from top hat analysis.
From the second drop down menu, set the reference annotation to the GTF file containing the current genome data. When presented with the perform bias correction option select yes and run the task using the default settings. Next, from the NGS RNA analysis menu open the cuff merge tool to merge the assembled transcripts produced for all the RNA sequence samples.
The first box in the tool loads all the GTF files produced using cuff links in the previous step. Now, select the assembled transcripts file for each of the strains or conditions tested including the biological replicates of the same strain condition. Select yes for user reference annotation, and choose the reference genome data file.
Next, select yes for the use sequence data option. This will automatically detect and choose the appropriate reference genome. Leave all other parameters in their default setting and click execute.
A single GTF file will be produced. To compare multiple strains or conditions go back to the NGS RNA analysis section and select the cuff div tool. Then from the transcripts menu in the cuff div tool select the merged output file from cuff merge.
Then enter labels for the two conditions. For each condition go to replicates and select the individual accepted hits output files from top hat that correspond to the different biological replicates of that condition. To select multiple files simultaneously hold down the command or control key.
After selecting the files, use the default parameter settings and click execute to run the task. Download differentially expressed data by clicking the save icon in the gene differential testing box generated in the history panel. To begin, access David from the website.
From the header of the webpage choose start analysis. Copy the list of genes obtained from Galaxy into box A and in this example select the gene identifier as worm base gene ID.Then under list type in question three choose gene list and click on the submit list icon. Now the analysis wizard homepage will open up from which David tasks can be selected.
This video segment describes a few of these options. First, choose functional annotation clustering to go to the summary page. Leave the annotation categories on their default settings and click on functional annotation clustering.
This option generates clusters of similar annotation terms ranked by their enrichment score. Now, return to the analysis wizard, and select the functional annotation chart option to identify significantly overrepresented biological terms associated with the gene list. A valuable David feature is the option to make a functional annotation table.
This lists all the annotations associated with the genes without showing any statistical calculations. This can be useful for gene by gene analysis and for finding specific genes of interest. Another useful David tool to review is the gene functional classification.
This option segregates the genes into a list of functionally related groups ranked by their enrichment score. The described protocol was used to identify genes whose expression is modulated by tcer-1 following germ line loss. The transcriptome of long lived germ line list glp-1 mutants was compared with tcer-1 glp-1 double mutants that our germ line lists but do not exhibit lifespan extension.
A quality check of the sequences found no poor quality reads, 48 to 49%GC content, and a constant sequence read length of 51 base pairs. Genome coverage of the samples was estimated to be between seven fold and 11 fold. Galaxy enabled combination of the NGS data from the two replicates of each strain and perform differential analysis to generate gene lists highlighting the genome wide expression profile.
Overall, 835 genes were differentially expressed between the two strains using a P value cut off of 0.05. Functional annotation analysis of the up regulated genes revealed four annotation clusters with high enrichment scores. The highest including cytochrome P450 and xenobiotic response genes, followed by genes implicated in lipid modifications.
Functional annotation clustering of the down regulated targets also identified a variety of annotation clusters. These included clusters enriched for cytoskeletal function, positive regulation of growth, reproduction, and aging. The Galaxy pipeline has paved the way for researchers in a wide range of biological disciplines, to analyze large scale gene expression changes rapidly and efficiently.
After watching this video, you should be able to conduct an RNA seek experiment, and analyze the raw high through put sequencing data using the Galaxy pipeline. You should also be able to extract biologically relevant information from the Galaxy data using the David platform.