A subscription to JoVE is required to view this content. Sign in or start your free trial.
DiCoExpress is a script-based tool implemented in R to perform an RNA-Seq analysis from quality control to co-expression. DiCoExpress handles complete and unbalanced design up to 2 biological factors. This video tutorial guides the user through the different features of DiCoExpress.
The proper use of statistical modeling in NGS data analysis requires an advanced level of expertise. There has recently been a growing consensus on using generalized linear models for differential analysis of RNA-Seq data and the advantage of mixture models to perform co-expression analysis. To offer a managed setting to use these modeling approaches, we developed DiCoExpress that provides a standardized R pipeline to perform an RNA-Seq analysis. Without any particular knowledge in statistics or R programming, beginners can perform a complete RNA-Seq analysis from quality controls to co-expression through differential analysis based on contrasts inside a generalized linear model. An enrichment analysis is proposed both on the lists of differentially expressed genes, and the co-expressed gene clusters. This video tutorial is conceived as a step-by-step protocol to help users take full advantage of DiCoExpress and its potential in empowering the biological interpretation of an RNA-Seq experiment.
Next-generation RNA sequencing (RNA-Seq) technology is now the gold standard of transcriptome analysis1. Since the early days of the technology, the combined efforts of bioinformaticians and biostatisticians have resulted in the development of numerous methods tackling all the essential steps of transcriptomic analyses, from mapping to transcript quantification2. Most of the tools available today to the biologist are developed within the R software environment for statistical computing and graphs3, and many packages for biological data analysis are available in the Bioconductor repository4. These packages offer total control and customization of the analysis, but they come at the cost of extensive use of a command-line interface. Because many biologists are more comfortable with a "point and click" approach5, the democratization of RNA-Seq analyses requires the development of more user-friendly interfaces or protocols6. For example, it is possible to build web interfaces of R packages using Shiny7, and command-line data analysis is made more intuitive with the R-studio8 interface. The development of dedicated, step-by-step tutorials can also help the novel user. In particular, a video tutorial supplements a classic text one, leading to a deeper understanding of all the procedure steps.
We recently developed DiCoExpress9, a tool for analyzing multifactorial RNA-Seq experiments in R using methods considered to be the best ones based on neutral comparison studies10,11,12. Starting from a count table, DiCoExpress proposes a data quality control step followed by a differential gene expression analysis (edgeR package13) using a generalized linear model (GLM) and the generation of co-expression clusters using Gaussian mixture models (coseq package12). DiCoExpress handles complete and unbalanced design up to 2 biological factors (i.e., genotype and treatment) and one technical factor (i.e., replicate). The originality of DiCoExpress lies in its directory architecture storing and organizing data, scripts, and results and in the automation of the writing of the contrasts allowing the user to investigate numerous questions within the same statistical model. An effort was also made to provide graphical outputs illustrating the statistical results.
The DiCoExpress workspace is available at https://forgemia.inra.fr/GNet/dicoexpress. It contains four directories, two pdf, and two text files. The Data/ directory contains the input datasets; for this protocol, we will use the "tutorial" dataset. The Sources/ directory contains seven R functions necessary to perform the analysis, and must not be modified by the user. The analysis is run using scripts stored in the Template_scripts/ directory. The one used in this protocol is called DiCoExpress_Tutorial_JoVE.R and can be easily adapted to any transcriptomic project. All the results are written in the Results/ directory and stored in a subdirectory named according to the project. The README.md file contains useful installation information, and any specific details concerning the method and its use can be found in the DiCoExpress_Reference_Manual.pdf file.
This video tutorial guides the user through the different features of DiCoExpress with the aim to overcome the reluctance felt by biologists using command-line-based tools. We present here the analysis of an artificial RNA-Seq dataset describing gene expression in three biological replicates of four genotypes, with or without treatment. We will now go through the different steps of the DiCoExpress workflow illustrated in Figure 1. The script described in the Protocol section and input files are available on the site: https://forgemia.inra.fr/GNet/dicoexpress
Prepare data files
The four csv files stored in the Data/ directory should be named according to the project name. In our example, all the names, therefore, begin with "Tutorial", and we will set Project_Name = "Tutorial" in Step 4 of the protocol. The separator used in the csv files must be indicated in the Sep variable in Step 4. In our "tutorial" dataset, the separator is a tabulation. For advanced users the full dataset can be reduced to a subset by providing a list of instructions and a new Project_Name through the Filter variable. This option avoids redundant copies of the input files and verifies FAIR principles14.
Among the four csv files, only the COUNTS and TARGET files are mandatory. They contain the raw counts for every gene (here Tutorial_COUNTS.csv) and the experimental design description (here Tutorial_TARGET.csv). The TARGET.csv file describes every sample (one sample per row) with a modality for each biological or technical factor (in the columns). We strongly recommend that the names chosen for the modalities start with a letter, not a number. The name of the last column ("Replicate") cannot be changed. Finally, the sample names (first column) must match the names in the headings of the COUNTS.csv file (Genotype1_control_rep1 in our example). The Enrichment.csv file in which every line contains one Gene_ID and one annotation term is only required if the user plans to run the enrichment analysis. If one gene has several annotations, they will have to be written on different lines. The Annotation.csv file is optional and is used to add a short description of every gene in the output files. The best way to get an annotation file is to retrieve the information from dedicated databases (e.g., Thalemine: https://bar.utoronto.ca/thalemine/begin.do for Arabidopsis).
Installation of DiCoExpress
DiCoExpress requires specific R packages. Use the command line source("../Sources/Install_Packages.R") in the R console to check the required package installation status. For users on Linux, another solution is to install the container dedicated to DiCoExpress and available at https://forgemia.inra.fr/GNet/dicoexpress/container_registry. By definition, this container contains DiCoExpress with all of the parts needed, such as libraries and other dependencies.
1. DiCoExpress
All the DiCoExpress outputs are saved in the Tutorial/ directory, itself placed within the Results/ directory. We provide here some guidance for assessing the overall quality of the analysis.
Quality Control
The quality control output, located in the Quality_Control/ directory, is essential to verify that the RNA-Seq analysis results are reliable. The Data_Quality_Control.pdf file contains several plots obtained with raw and normalized data that can be used to identify a...
Because RNA-Seq has become a ubiquitous method in biological studies, there is a constant need to develop versatile and user-friendly analytical tools. A critical step within most of the analytical workflows is often to identify with confidence the genes differentially expressed between biological conditions and/or treatments15. The production of reliable results requires proper statistical modeling, which has been the motivation for the development of DiCoExpress.
DiCo...
The authors have nothing to disclose
This work was mainly supported by the ANR PSYCHE (ANR-16-CE20-0009). The authors thank F. Desprez for the construction of the container of DiCoExpress. KB work is supported by the Investment for the Future ANR-10-BTBR-01-01 Amaizing program. The GQE and IPS2 laboratories benefit from the support of Saclay Plant Sciences-SPS (ANR-17-EUR-0007).
Name | Company | Catalog Number | Comments |
Request permission to reuse the text or figures of this JoVE article
Request PermissionExplore More Articles
This article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved