Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.
Method Article
Galaxy and DAVID have emerged as popular tools that allow investigators without bioinformatics training to analyze and interpret RNA-Seq data. We describe a protocol for C. elegans researchers to perform RNA-Seq experiments, access and process the dataset using Galaxy and obtain meaningful biological information from the gene lists using DAVID.
Next generation sequencing (NGS) technologies have revolutionized the nature of biological investigation. Of these, RNA Sequencing (RNA-Seq) has emerged as a powerful tool for gene-expression analysis and transcriptome mapping. However, handling RNA-Seq datasets requires sophisticated computational expertise and poses inherent challenges for biology researchers. This bottleneck has been mitigated by the open access Galaxy project that allows users without bioinformatics skills to analyze RNA-Seq data, and the Database for Annotation, Visualization, and Integrated Discovery (DAVID), a Gene Ontology (GO) term analysis suite that helps derive biological meaning from large data sets. However, for first-time users and bioinformatics' amateurs, self-learning and familiarization with these platforms can be time-consuming and daunting. We describe a straightforward workflow that will help C. elegans researchers to isolate worm RNA, conduct an RNA-Seq experiment and analyze the data using Galaxy and DAVID platforms. This protocol provides stepwise instructions for using the various Galaxy modules for accessing raw NGS data, quality-control checks, alignment, and differential gene expression analysis, guiding the user with parameters at every step to generate a gene list that can be screened for enrichment of gene classes or biological processes using DAVID. Overall, we anticipate that this article will provide information to C. elegans researchers undertaking RNA-Seq experiments for the first time as well as frequent users running a small number of samples.
The first sequencing of the human genome, performed using Fred Sanger's dideoxynucleotide-sequencing method, took 10 years, and cost an estimated US $3 billion1,2. However, in a little over a decade since its inception, Next-Generation Sequencing (NGS) technology has made it possible to sequence the entire human genome within two weeks and for US $1,000. New NGS instruments that allow ever-increasing speeds of sequencing-data collection with incredible efficiency, along with sharp reductions in cost, are revolutionizing modern biology in unimaginable ways as genome sequencing projects are rapidly becoming commonplace. In addition, these developments have galvanized progress in many other areas such as gene-expression analysis through RNA-Sequencing (RNA-Seq), study of genome-wide epigenetic modifications, DNA-protein interactions, and screening for microbial diversity in human hosts. NGS-based RNA-Seq in particular has made it possible to identify and map transcriptomes comprehensively with accuracy and sensitivity, and has replaced microarray technology as the method of choice for expression profiling. While microarray technology has been used extensively, it is limited by its reliance on pre-existing arrays with known genomic information, and other drawbacks such as cross hybridization and restricted range of expression changes that can be measured reliably. RNA-seq, on the other hand, can be used to detect both known and unknown transcripts while producing low background noise due to its unambiguous DNA mapping nature. RNA-Seq, together with the numerous genetic tools offered by model organisms such as yeast, flies, worms, fish and mice, has served as the foundation for many important recent biomedical discoveries. However, significant challenges remain that make NGS inaccessible to the wider scientific community, including limitations of storage, processing, and most of all, meaningful bioinformatics analysis of large volumes of sequencing data.
The rapid advances in sequencing technologies and exponential data accumulation have created a great need for computational platforms that will allow researchers to access, analyze and understand this information. Early systems were heavily dependent upon computer programming knowledge, whereas, genome browsers such as NCBI that allowed non-programmers to access and visualize data did not permit sophisticated analyses. The web-based, open-access platform, Galaxy (https://galaxyproject.org/), has filled this void and proven to be a valuable pipeline that enables researchers to process NGS data and perform a spectrum of simple-to-complex bioinformatics analyses. Galaxy was initially established, and is maintained, by the laboratories of Anton Nekrutenko (Penn State University) and James Taylor (Johns Hopkins University)3. Galaxy offers a wide range of computational tasks making it a 'one-stop shop' for innumerable bioinformatics needs, including all the steps involved in an RNA-Seq study. Itallows users to perform data processing either on its servers or locally on their own machines. Data and workflows can be reproduced and shared. Online tutorials, help section, and a wiki-page (https://wiki.galaxyproject.org/Support) dedicated to the Galaxy Project provide consistent support. However, for first-time users, especially those with no bioinformatics training, the pipeline can appear daunting and the process of self-learning and familiarization can be time consuming. In addition, the biological system studied, and specifics of the experiment and methods used, impact the analytical decisions at several steps, and these can be difficult to navigate without instruction.
The Overall RNA-Seq Galaxy Workflow consists of data upload and quality check followed by analysis using the Tuxedo Suite4,5,6,7,8,9, which is a collective of various tools required for different stages of RNA-Seq data analysis10,11,12,13,14. A typical RNA-Seq experiment consists of the experimental part (sample preparation, mRNA isolation and cDNA library preparation), the NGS and the bioinformatics data analysis. An overview of these sections, and the steps involved in the Galaxy pipeline, are shown in Figure 1.
Figure 1: Overview of an RNA-Seq Workflow. Illustration of the experimental and computational steps involved in an RNA-Seq experiment to compare the gene-expression profiles of two worm strains (A and B, orange and green lines and arrows, respectively). The different modules of Galaxy utilized are shown in boxes with the corresponding step in our protocol indicated in red. The outputs of various operations are written in grey with the file formats shown in blue. Please click here to view a larger version of this figure.
The first tool in the Tuxedo Suite is an alignment program called 'Tophat'. It breaks down the NGS input reads into smaller fragments and then maps them to a reference genome. This two-step process ensures that reads spanning intronic regions whose alignment can otherwise be disrupted or missed are accounted for and mapped. This increases coverage and facilitates the identification of novel splice junctions. Tophat output is reported as two files, a BED file (with information about splice junctions that include genomic location) and a BAM file (with mapping details of each read). Next, the BAM file is aligned against a reference genome to estimate the abundance of individual transcripts within each sample using the subsequent tool in the Tuxedo Suite called 'Cufflinks'. Cufflinks functions by scanning the alignment to report full-length transcript fragments or 'transfrags' that span all the possible splice variants in the input data for every gene. Based on this, it generates a 'transcriptome' (assembly of all the transcripts generated per gene for every gene) for each sample being sequenced. These Cufflinks assemblies are then collapsed or merged together along with the reference genome to produce a single annotation file for downstream differential analysis using the next tool, 'Cuffmerge'. Finally, the 'Cuffdiff' tool measures differential gene expression between samples by comparing the TopHat outputs of each of the samples to the final Cuffmerge output file (Figure 1). Cufflinks uses FPKM/RPKM (Fragments/Reads Per Kilobase of transcript per Million mapped reads) values to report transcript abundances. These values reflect the normalization of the raw NGS data for depth (average number of reads from a sample that align to the reference genome) and gene length (genes have different lengths, so counts have to be normalized for length of a gene to compare levels between genes). FPKM and RPKM are essentially the same with RPKM being used for single-end RNA-Seq where every read corresponds to a single fragment, whereas, FPKM is used for paired-end RNA-Seq, as it accounts for the fact that two reads can correspond to the same fragment. Ultimately, the outcome of these analyses is a list of genes differentially expressed between the conditions and/or strains tested.
Once a successful Galaxy run is completed and a 'gene list' is generated, the next logical step requires more bioinformatics analyses to deduce meaningful knowledge from the datasets. Many software packages have emerged to cater to this need, including publicly-available web-based computational packages such as DAVID (the Database for Annotation, Visualization and Integrated discovery)15. DAVID facilitates assigning biological meaning to large gene lists from high-throughput studies by comparing the uploaded gene list to its integrated biological knowledgebase and revealing the biological annotations associated with the gene list. This is followed by Enrichment Analysis, i.e., tests to identify if any biological process or gene class is overrepresented in the gene list(s) in a statistically significant manner. It has become a popular choice because of a combination of a wide, integrated knowledge-base and powerful analytical algorithms that enable researchers to detect biological themes enriched within genomics-derived 'gene lists'10,16. Additional advantages include its ability to process gene lists created on any sequencing platform and a highly user-friendly interface.
The nematode Caenorhabditis elegans is a genetic model system, well known for its many advantages such as small size, transparent body, simple body plan, ease of culture and great amenability to genetic and molecular dissection. Worms have a small, simple and well-annotated genome that includes up to 40% conserved genes with known human homologs17. Indeed, C. elegans was the first metazoan whose genome was completely sequenced18, and one of the first species where RNA-Seq was used to map an organism's transcriptome19,20. Early worm studies involved experimentation with different methods for high-throughput RNA capture, library preparation and sequencing as well as bioinformatics pipelines that contributed to the advancement of the technology21,22. In recent years, RNA-Seq-based experimentation in worms has become commonplace. But, for traditional worm biologists the challenges posed by computational analysis of RNA-Seq data remain a major impediment for greater and better utilization of the technique.
In this article, we describe a protocol for using the Galaxy platform to analyze high-throughput RNA-Seq data generated from C. elegans. For many first-time and small-scale users, the most cost-efficient and straightforward way to undertake an RNA-Seq experiment is to isolate RNA in the lab and utilize a commercial (or in-house) NGS facility for preparation of sequencing cDNA libraries and the NGS itself. Hence, we have first detailed the steps involved in isolation, quantification and quality assessment of C. elegans RNA samples for RNA-Seq. Next, we provide step-by-step instructions for using the Galaxy interface for analyses of the NGS data, beginning with tests for post-sequencing quality-control checks followed by alignment, assembly, and differential quantification of gene expression. In addition, we have included directions to scrutinize the gene lists resulting from Galaxy for biological enrichment studies using DAVID. As a final step in the workflow, we provide instructions for uploading RNA-Seq data on to public servers such as the Sequence Read Archive (SRA) on NCBI (http://www.ncbi.nlm.nih.gov/sra) to make it freely accessible to the scientific community. Overall, we anticipate that this article will provide comprehensive and sufficient information to worm biologists undertaking RNA-Seq experiments for the first time as well as frequent users running a small number of samples.
1. RNA Isolation
2. RNA-Seq Data Analysis
Figure 2: Layout of the Galaxy User Interface Panel and Key RNA-Seq Functions. Key features of the page are expanded and highlighted. (A) highlights the 'Analyze data' function in the webpage header used to access Analysis Home View. (B) is the 'Progress bar' that indicates the space on the Galaxy server utilized by the operation. (C) is the 'Tools Section' that lists all the tools that can be run on the Galaxy interface. (D) shows the 'NGS: RNA Analysis' tool section used for RNA-Seq analysis. (E) depicts the 'History' panel that lists all the files generated using Galaxy. (F) shows an example of the dialogue box that opens up when clicking on any file in the History section. Within (F), the blue box highlights icons that can be used to view, editthe attributes or delete the dataset, the purple box highlights icons that can be used to 'edit' the dataset tags or annotation, and, the red box indicates icons to download the data, view details of the task performed or rerun the operation. Please click here to view a larger version of this figure.
3. Gene Ontology (GO) Term Analysis using DAVID
Figure 3: Layout of the DAVID Analysis Wizard Webpage and Examples of Operation Outputs. The 'Analysis Wizard' web user-interface lists the tools used to analyze uploaded gene list for enrichment based on various parameters. Clicking on these tools reports the analyzed data in a new web page. Examples of the tabular reports generated from 'Gene Functional Classification', 'Functional Annotation Chart' and 'Functional Annotation Clustering' are shown as insets (arrows). Please click here to view a larger version of this figure.
4. Uploading RAW Data onto the NCBI Sequence Read Archive (SRA)
In C. elegans, elimination of the germline stem cells (GSCs) extends lifespan, enhances stress resilience, and elevates body fat24,28. Loss of GSCs, either brought about by laser-ablation or by mutations such as glp-1, causes lifespan extension through activation of a network of transcription factors29. One such factor, TCER-1, encodes the worm homolog of the human transcription elongation...
Significance of the Galaxy Sequencing Platform in Modern Biology
The Galaxy Project has become instrumental in helping biologists without bioinformatics training to process and analyze high-throughput sequencing data in a fast and efficient manner. Once considered a herculean task, this publicly-available platform has made running complex bioinformatics algorithms to analyze NGS data a straightforward, reliable, and easy process. Apart from hosting a wide range of bioinformatics tools, the key to...
The authors have nothing to disclose.
The authors would like to express their gratitude to the laboratories, groups and individuals who have developed Galaxy and DAVID, and thus made NGS widely accessible for the scientific community. The help and advice provided by colleagues at the University of Pittsburgh during our bioinformatics training is acknowledged. This work was supported by an Ellison Medical Foundation New Scholar in Aging award (AG-NS-0879-12) and a grant from the National Institutes of Health (R01AG051659) to AG.
Name | Company | Catalog Number | Comments |
RNase spray | Fisher Scientific | 21-402-178 | |
Trizol | Ambion | 15596026 | |
Sonicator | Sonics Vibra Cell | VCX130 | |
Centrifuge | Eppendorf | 5415C | |
chloroform | Sigma Aldrich | 288306 | |
2-propanol | Fisher Scientific | A416P-4 | |
Ethanol | Decon Labs | 2705HC | |
RNase-free water | Fisher Scientific | BP561-1 | |
Bioanalyzer | Agilent | G2940CA | |
Mac/PC |
Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE
Zapytaj o uprawnieniaThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone