Analyzing Multifactorial RNA-Seq Experiments with DiCoExpress

Kevin Baudry; Christine Paysant-Le Roux; Stefano Colella; Benoît Castandet; Marie-Laure Martin

doi:10.3791/62566

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

DiCoExpress is a script-based tool implemented in R to perform an RNA-Seq analysis from quality control to co-expression. DiCoExpress handles complete and unbalanced design up to 2 biological factors. This video tutorial guides the user through the different features of DiCoExpress.

Abstract

The proper use of statistical modeling in NGS data analysis requires an advanced level of expertise. There has recently been a growing consensus on using generalized linear models for differential analysis of RNA-Seq data and the advantage of mixture models to perform co-expression analysis. To offer a managed setting to use these modeling approaches, we developed DiCoExpress that provides a standardized R pipeline to perform an RNA-Seq analysis. Without any particular knowledge in statistics or R programming, beginners can perform a complete RNA-Seq analysis from quality controls to co-expression through differential analysis based on contrasts inside a generalized linear model. An enrichment analysis is proposed both on the lists of differentially expressed genes, and the co-expressed gene clusters. This video tutorial is conceived as a step-by-step protocol to help users take full advantage of DiCoExpress and its potential in empowering the biological interpretation of an RNA-Seq experiment.

Introduction

Next-generation RNA sequencing (RNA-Seq) technology is now the gold standard of transcriptome analysis¹. Since the early days of the technology, the combined efforts of bioinformaticians and biostatisticians have resulted in the development of numerous methods tackling all the essential steps of transcriptomic analyses, from mapping to transcript quantification². Most of the tools available today to the biologist are developed within the R software environment for statistical computing and graphs³, and many packages for biological data analysis are available in the Bioconductor repository⁴. These packages offer total control and customization of the analysis, but they come at the cost of extensive use of a command-line interface. Because many biologists are more comfortable with a "point and click" approach⁵, the democratization of RNA-Seq analyses requires the development of more user-friendly interfaces or protocols⁶. For example, it is possible to build web interfaces of R packages using Shiny⁷, and command-line data analysis is made more intuitive with the R-studio⁸ interface. The development of dedicated, step-by-step tutorials can also help the novel user. In particular, a video tutorial supplements a classic text one, leading to a deeper understanding of all the procedure steps.

We recently developed DiCoExpress⁹, a tool for analyzing multifactorial RNA-Seq experiments in R using methods considered to be the best ones based on neutral comparison studies¹⁰^,¹¹^,¹². Starting from a count table, DiCoExpress proposes a data quality control step followed by a differential gene expression analysis (edgeR package¹³) using a generalized linear model (GLM) and the generation of co-expression clusters using Gaussian mixture models (coseq package¹²). DiCoExpress handles complete and unbalanced design up to 2 biological factors (i.e., genotype and treatment) and one technical factor (i.e., replicate). The originality of DiCoExpress lies in its directory architecture storing and organizing data, scripts, and results and in the automation of the writing of the contrasts allowing the user to investigate numerous questions within the same statistical model. An effort was also made to provide graphical outputs illustrating the statistical results.

The DiCoExpress workspace is available at https://forgemia.inra.fr/GNet/dicoexpress. It contains four directories, two pdf, and two text files. The Data/ directory contains the input datasets; for this protocol, we will use the "tutorial" dataset. The Sources/ directory contains seven R functions necessary to perform the analysis, and must not be modified by the user. The analysis is run using scripts stored in the Template_scripts/ directory. The one used in this protocol is called DiCoExpress_Tutorial_JoVE.R and can be easily adapted to any transcriptomic project. All the results are written in the Results/ directory and stored in a subdirectory named according to the project. The README.md file contains useful installation information, and any specific details concerning the method and its use can be found in the DiCoExpress_Reference_Manual.pdf file.

This video tutorial guides the user through the different features of DiCoExpress with the aim to overcome the reluctance felt by biologists using command-line-based tools. We present here the analysis of an artificial RNA-Seq dataset describing gene expression in three biological replicates of four genotypes, with or without treatment. We will now go through the different steps of the DiCoExpress workflow illustrated in Figure 1. The script described in the Protocol section and input files are available on the site: https://forgemia.inra.fr/GNet/dicoexpress

Prepare data files
The four csv files stored in the Data/ directory should be named according to the project name. In our example, all the names, therefore, begin with "Tutorial", and we will set Project_Name = "Tutorial" in Step 4 of the protocol. The separator used in the csv files must be indicated in the Sep variable in Step 4. In our "tutorial" dataset, the separator is a tabulation. For advanced users the full dataset can be reduced to a subset by providing a list of instructions and a new Project_Name through the Filter variable. This option avoids redundant copies of the input files and verifies FAIR principles¹⁴.

Among the four csv files, only the COUNTS and TARGET files are mandatory. They contain the raw counts for every gene (here Tutorial_COUNTS.csv) and the experimental design description (here Tutorial_TARGET.csv). The TARGET.csv file describes every sample (one sample per row) with a modality for each biological or technical factor (in the columns). We strongly recommend that the names chosen for the modalities start with a letter, not a number. The name of the last column ("Replicate") cannot be changed. Finally, the sample names (first column) must match the names in the headings of the COUNTS.csv file (Genotype1_control_rep1 in our example). The Enrichment.csv file in which every line contains one Gene_ID and one annotation term is only required if the user plans to run the enrichment analysis. If one gene has several annotations, they will have to be written on different lines. The Annotation.csv file is optional and is used to add a short description of every gene in the output files. The best way to get an annotation file is to retrieve the information from dedicated databases (e.g., Thalemine: https://bar.utoronto.ca/thalemine/begin.do for Arabidopsis).

Installation of DiCoExpress
DiCoExpress requires specific R packages. Use the command line source("../Sources/Install_Packages.R") in the R console to check the required package installation status. For users on Linux, another solution is to install the container dedicated to DiCoExpress and available at https://forgemia.inra.fr/GNet/dicoexpress/container_registry. By definition, this container contains DiCoExpress with all of the parts needed, such as libraries and other dependencies.

Protocol

1. DiCoExpress

Open a R studio session and set directory to Template_scripts.
Open the DiCoExpress_Tutorial.R script in R studio.
Load DiCoExpress functions in the R session with the following commands:
> source("../Sources/Load_Functions.R")
> Load_Functions()
> Data_Directory = "../Data"
> Results_Directory = "../Results/"
Load data files in the R session with the following commands:
> Project_Name = "Tutorial"
> Filter = NULL
> Sep="\t"
> Data_Files = Load_Data_Files(Data_Directory, Project_Name, Filter, Sep)
Split the object Data_Files in several objects to manipulate them easily:
> Project_Name = Data_Files$Project_Name
> Target = Data_Files$Target
> Raw_Counts = Data_Files$Raw_Counts
> Annotation = Data_Files$Annotation
> Reference_Enrichment = Data_Files$Reference_Enrichment
Choose a strategy among "NbConditions", "NbReplicates" or "filterByExpr" and a threshold to filter low expressed genes. Here we choose
> Filter_Strategy = "NbReplicates"
> CPM_Cutoff = 1
Specify group colors with the command
> Color_Group = NULL
NOTE: When it is set to NULL, R automatically attributes colors to the biological conditions. Otherwise enter a vector indicating a color per biological group.
Choose a normalization method among those accepted by the function calcNormFactors of edgeR. As for example
> Normalization_Method = "TMM"
Perform the quality control by executing the following function
> Quality_Control(Data_Directory, Results_Directory, Project_Name, Target, Raw_Counts, Filter_Strategy, Color_Group, CPM_Cutoff, Normalization_Method)
State Replicate = TRUE if data are paired according to the replicate factor, FALSE otherwise.
Assign Interaction = TRUE to consider an interaction between the two biological factors, FALSE otherwise.
Specify the statistical model with the following commands
> Model = GLM_Contrasts(Results_Directory, Project_Name, Target, Replicate, Interaction)
> GLM_Model = Model$GLM_Model
> Contrasts = Model$Contrasts
Define the threshold of the False Discovery Rate, here 0.05
> Alpha_DiffAnalysis =0.05
Perform the differential analysis with the following commands
> Index_Contrast=1:nrow(Contrasts)
> NbGenes_Profiles = 20
> NbGenes_Clustering = 50
> DiffAnalysis.edgeR (Data_Directory, Results_Directory, Project_Name, Target, Raw_Counts, GLM_Model, Contrasts, Index_Contrast, Filter_Strategy, Alpha_DiffAnalysis, NbGenes_Profiles, NbGenes_Clustering, CPM_Cutoff, Normalization_Method)
Fix a threshold for the enrichment analysis, here 0.01
> Alpha_Enrichment = 0.01
Perform the enrichment analysis of differentially expressed genes (DEG) lists
> Title = NULL
> Enrichment(Results_Directory, Project_Name, Title, Reference_Enrichment, Alpha_Enrichment)
Choose DEG lists to be compared. As for example,
> Groups = Contrasts$Contrasts[24:28]
Provide a name for the list comparison. This name is used for the directory where the output files will be saved
> Title = "Interaction_with_Genotypes_1_and_2"
Specify the action to be done on the DEG lists by setting the parameter Operation to union or intersection. We choose
> Operation = "Union"
Compare the DEGs lists
> Venn_IntersectUnion(Data_Directory, Results_Directory, Project_Name, Title, Groups, Operation)
Perform a co-expression analysis with the function
> Coexpression_coseq(Data_Directory, Results_Directory, Project_Name, Title, Target, Raw_Counts, Color_Group)
Perform enrichment analysis of the co-expression clusters
> Enrichment(Results_Directory, Project_Name, Title, Reference_Enrichment, Alpha_Enrichment)
Generate two log files containing all the necessary information to reproduce the analysis
> Save_Parameters( )
NOTE: Command lines used in this protocol are shown in Figure 2. Lines that have to be modified to analyze another dataset are highlighted.

Results

All the DiCoExpress outputs are saved in the Tutorial/ directory, itself placed within the Results/ directory. We provide here some guidance for assessing the overall quality of the analysis.

Quality Control
The quality control output, located in the Quality_Control/ directory, is essential to verify that the RNA-Seq analysis results are reliable. The Data_Quality_Control.pdf file contains several plots obtained with raw and normalized data that can be used to identify a...

Discussion

Because RNA-Seq has become a ubiquitous method in biological studies, there is a constant need to develop versatile and user-friendly analytical tools. A critical step within most of the analytical workflows is often to identify with confidence the genes differentially expressed between biological conditions and/or treatments¹⁵. The production of reliable results requires proper statistical modeling, which has been the motivation for the development of DiCoExpress.

DiCo...

Disclosures

The authors have nothing to disclose

Acknowledgements

This work was mainly supported by the ANR PSYCHE (ANR-16-CE20-0009). The authors thank F. Desprez for the construction of the container of DiCoExpress. KB work is supported by the Investment for the Future ANR-10-BTBR-01-01 Amaizing program. The GQE and IPS2 laboratories benefit from the support of Saclay Plant Sciences-SPS (ANR-17-EUR-0007).

Materials

Name	Company	Catalog Number	Comments

References

Wang, Z., Gerstein, M., Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics. 10 (1), 57-63 (2009).
Yang, I. S., Kim, S. Analysis of Whole Transcriptome Sequencing Data: Workflow and Software. Genomics & Informatics. 13 (4), 119-125 (2015).
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. , (2020).
Huber, W., et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods. 12 (2), 115-121 (2015).
Smith, D. R. The battle for user-friendly bioinformatics. Frontiers in Genetics. 4, 187 (2013).
Pavelin, K., Cham, J. A., de Matos, P., Brooksbank, C., Cameron, G., Steinbeck, C. Bioinformatics Meets User-Centred Design: A Perspective. PLoS Computational Biology. 8 (7), 1002554 (2012).
. Shiny: web application framework Available from: https://rdrr.io/cran/shiny/ (2021)
Lambert, I., Roux, C. P. -. L., Colella, S., Martin-Magniette, M. -. L. DiCoExpress: a tool to process multifactorial RNAseq experiments from quality controls to co-expression analysis through differential analysis based on contrasts inside GLM models. Plant methods. 16 (1), 68 (2020).
Dillies, M. -. A., et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in bioinformatics. 14 (6), 671-683 (2012).
Rigaill, G. Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis. Briefings in Bioinformatics. 19 (1), (2016).
Rau, A., Maugis-Rabusseau, C. Transformation and model choice for RNA-seq co-expression analysis. Briefings in Bioinformatics. 19 (3), (2017).
Robinson, M. D., McCarthy, D. J., Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26 (1), 139-140 (2009).
Wilkinson, M. D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 3 (1), 160018 (2016).
Stark, R., Grzelak, M., Hadfield, J. RNA sequencing: the teenage years. Nature Reviews Genetics. 20 (11), 631-656 (2019).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

DiCoExpress RNA Seq Aoristic Analysis Quality Control Differential Analysis Generalized Linear Model Enrichment Analysis Differentially Expressed Genes Co expressed Gene Clusters Non specialist User R Programming Biological Factors Statistical Model False Discovery Rate DEG Lists Co expression Analysis

This article has been published

Video Coming Soon

Keep me updated: