Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Shiyi Liu; Zitao Wang; Ronghui Zhu; Feiyan Wang; Yanxiang Cheng; Yeqiang Liu

doi:10.3791/62528

Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.

Method Article

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

DOI:

10.3791/62528

⸱

September 18th, 2021

Shiyi Liu*¹, Zitao Wang*¹, Ronghui Zhu¹, Feiyan Wang², Yanxiang Cheng¹, Yeqiang Liu²

¹Department of Obstetrics and Gynecology, Renmin Hospital of Wuhan University, ²Department of Pathology, Shanghai Skin Disease Hospital, Tongji University School of Medicine

* Wspomniani autorzy wnieśli do projektu równy wkład.

Please note that all translations are automatically generated. Click here for the English version.

Podsumowanie

A detailed protocol of differential expression analysis methods for RNA sequencing was provided: limma, EdgeR, DESeq2.

Streszczenie

RNA sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics as it can reveal the relationship between the genetic alteration and complex biological processes and has great value in diagnostics, prognostics, and therapeutics of tumors. Differential analysis of RNA-seq data is crucial to identify aberrant transcriptions, and limma, EdgeR and DESeq2 are efficient tools for differential analysis. However, RNA-seq differential analysis requires certain skills with R language and the ability to choose an appropriate method, which is lacking in the curriculum of medical education.

Herein, we provide the detailed protocol to identify differentially expressed genes (DEGs) between cholangiocarcinoma (CHOL) and normal tissues through limma, DESeq2 and EdgeR, respectively, and the results are shown in volcano plots and Venn diagrams. The three protocols of limma, DESeq2 and EdgeR are similar but have different steps among the processes of the analysis. For example, a linear model is used for statistics in limma, while the negative binomial distribution is used in edgeR and DESeq2. Additionally, the normalized RNA-seq count data is necessary for EdgeR and limma but is not necessary for DESeq2.

Here, we provide a detailed protocol for three differential analysis methods: limma, EdgeR and DESeq2. The results of the three methods are partly overlapping. All three methods have their own advantages, and the choice of method only depends on the data.

Wprowadzenie

RNA-sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics with many advantages (e.g., high data reproducibility), and has dramatically increased our understanding of the functions and dynamics of complex biological processes¹^,². Identification of aberrate transcripts under different biological context, which are also known as differentially expressed genes (DEGs), is a key step in RNA-seq analysis. RNA-seq makes it possible to get a deep understanding of pathogenesis related molecular mechanisms and biological functions. Therefore, differential analysis has been regarded as valuable for diagnostics, prognostics and therapeutics of tumors³^,⁴^,⁵. Currently, more open-source R/Bioconductor packages have been developed for RNA-seq differential expression analysis, particularly limma, DESeq2 and EdgeR¹^,⁶^,⁷. However, differential analysis requires certain skills with R language and the ability to choose the appropriate method, which is lacking in the curriculum of medical education.

In this protocol, based on the cholangiocarcinoma (CHOL) RNA-seq count data extracted from The Cancer Genome Atlas (TCGA), three of the most known methods (limma⁸, EdgeR⁹ and DESeq2¹⁰) were carried out, respectively, by the R program¹¹ to identify the DEGs between CHOL and normal tissues. The three protocols of limma, EdgeR and DESeq2 are similar but have different steps among the processes of the analysis. For example, the normalized RNA-seq count data is necessary for EdgeR and limma⁸^, ⁹, whereas DESeq2 uses its own library discrepancies to correct data instead of normalization¹⁰. Furthermore, edgeR is specifically suitable for RNA-seq data, while the limma is used for microarrays and RNA-seq. A linear model is adopted by limma to assess the DEGs¹², while the statistics in edgeR are based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests⁹.

In summary, we provide the detailed protocols of RNA-seq differential expression analysis by using limma, DESeq2 and EdgeR, respectively. By referring to this article, users can easily perform the RNA-seq differential analysis and choose the appropriate differential analysis methods for their data.

Protokół

NOTE: Open the R-studio program and load R file "DEGs.R", the file can be acquired from Supplementary files/Scripts.

1. Downloading and pre-processing of data

Download the high-throughput sequencing (HTSeq) count data of cholangiocarcinoma (CHOL) from The Cancer Genome Atlas (TCGA). This step can be easily achieved by the following R code.
1. Click Run to install R packages.
2. Click Run to load R packages.
  if(!requireNamespace("BiocManager", quietly=TRUE))
  + install.packages("BiocManager")
  BiocManager::install(c("TCGAbiolinks", "SummarizedExperiment"))
3. Set the working directory.
  library (TCGAbiolinks)
  library(SummarizedExperiment)
  setwd("C:/Users/LIUSHIYI/Desktop")
4. Choose the cancer type.
  cancer <- "TCGA-CHOL"
5. Run the R code from the "GDCquery.R" file to download the data. The file "GDCquery.R" can be acquired from Supplementary files/Scripts:
  source("Supplementary files/Scripts/GDCquery.R")
  head(cnt)
  ##TCGA-3X-AAVA-01A-11R-A41I-07
  ##ENSG00000000003 4262
  ##ENSG00000000005 1
  ##ENSG00000000419 1254
  ##ENSG00000000457 699
  ##ENSG00000000460 239
  ##ENSG00000000938 334
  NOTE: After execution, the CHOLHTSeq count data will be downloaded and named "cnt", where rows represent ensemble gene IDs and columns represent sample IDs. Please notice the numbers at positions 14-15 in the sample IDs; numbers ranging from 01 to 09 indicate tumors and ranging from 10 to 19 indicate normal tissues.
Convert ensemble gene IDs to gene symbols.
1. Import the annotation file into R according to its storage path. The annotation file (gencode.v22.annotation.gtf) can be acquired from Supplementary files.
  gtf_v22 <- rtracklayer::import('Supplementary files/gencode.v22.annotation.gtf')
2. Run the R code from the "gtf_v22.R" file, which can be acquired from Supplementary files/Scripts:
  source("Supplementary files/Scripts/gtf_v22.R")
3. Apply the function "ann" to convert the ensemble gene IDs to gene symbols.
  cnt=ann(cnt,gtf_v22)
Filtering low-expressed genes
1. Click Run to install the R package "edgeR".
  BiocManager::install("edgeR")
2. Click Run to load the R package "edgeR".
  library(edgeR)
3. Run the following R code to keep genes with counts per million (CPM) values greater than one in at least two samples.
  keep <- rowSums(cpm(cnt)>1)>=2
  cnt <- as.matrix(cnt[keep,])
  NOTE: The counts per million (CPM) value is used instead of the read count to eliminate the deviation caused by different sequencing depths.

2. Differential expression analysis through "limma"

Click Run to install the R package "limma".
BiocManager::install("limma")
Click Run to load the R packages "limma", "edgeR".
library(limma)
library(edgeR)
Run the following R code to create the design matrix.
group <- substring(colnames(cnt),14,15) # Extract group information
group [group %in% "01"] <- "Cancer" # set '01' as tumor tissue
group [group %in% "11"] <- "Normal" # set '11' as normal tissue
group <- factor (group, levels = c("Normal","Cancer"))
1. Create the design matrix.
  design <- model.matrix (~group)
  rownames(design) <- colnames(cnt)
2. Create the DGEList object.
  dge <- DGEList(counts = cnt, group = group)
3. Normalize the data.
  dge <- calcNormFactors(dge, method = "TMM")
4. Run the following R code to perform the limma-trend method based differential expression analysis.
  dge
  ##An object of class "DGEList"
  ##$counts
  ##TCGA-3X-AAVA-01A-11R-A41I-07
  ##TSPAN6 4262
  ##DPM1 1254
  ##SCYL3 699
  ##C1orf112 239
  ##FGR 334
5. Calculate the CPM value.
  logdge <- cpm(dge, log=TRUE, prior.count=3)
6. Click Run to fit a linear model to predict the data or infer the relationship between variables.
  fit <- lmFit (logdge, design)
7. Calculate the T value, F value and log-odds based on Bayesian.
  fit <- eBayes(fit, trend=TRUE)
8. Extract the result table.
  res_limma<- as.data.frame(topTable(fit,n=Inf))
  
  head(res_limma)
  ## logFC AveExpr t P.Value adj.P.Val B
  ##RP11-252E2.2 -4.899493 -2.488589 -20.88052 2.386656e-25 4.931786e-21 47.28823
  ##BX842568.1 -4.347930 -2.595205 -20.14532 1.082759e-24 1.118706e-20 45.83656
  ##CTC-537E7.3 -5.154894 -2.143292 -19.59571 3.452354e-24 2.216114e-20 44.72001
  ##RP11-468N14.3 -6.532259 -2.029714 -19.49409 4.289807e-24 2.216114e-20 44.51056
  ##AP006216.5 -4.507051 -2.670915 -19.25649 7.153356e-24 2.956339e-20 44.01704
  ##RP11-669E14.4 -4.107204 -2.828311 -18.93246 1.448209e-23 4.987633e-20 43.33543
  #The result of differential expression analysis is saved in "res_limma", which includes the gene id, log2 fold change value (logFC), the average log2 expression level of the gene in the experiment (AveExpr), the modified t statistic (t), relavent p value (P.Value), the false discovery rate (FDR) corrected p value (adj.P.Val) and the log-odds of differentially expressed genes (B)
  NOTE: The function "calcNormFactors()" of the "edgeR" was used to normalize the data to eliminate the influence caused by sample preparation or library construction and sequencing. In the construction of design matrix, it is necessary to match experimental design (e.g., tissue type: normal or tumor tissues) to sample IDs of the matrix. limma-trend is suitable to data whose sequencing depth is the same, while limma-voom is suitable: (i) when the sample library size is different; (ii) data not normalized by TMM; (iii) there is a lot of "noise" in the data. A positive logFC means that gene is up-regulated in the experiment, while negative number means that gene is down-regulated.
9. Identify the DEGs.
  res_limma$sig <- as.factor(
  ifelse(res_limma$adj.P.Val < 0.05 & abs(res_limma$logFC) > 2,
  ifelse(res_limma$logFC > 2 ,'up','down'),'not')) # The adj.p Value < 0.05 and the |log2FC| >= 2 are thresholds to identify the DEGs
  summary(res_limma$sig)
  ##down not up
  ##1880 17341 1443
10. Output the result table to a file.
  write.csv(res_limma, file = 'result_limma.csv')
11. Click Run to install the R package "ggplot2".
  install.packages("ggplot2")
12. Click Run to load the R package "ggplot2".
  library(ggplot2)
13. Run the R code from the "volcano.R" to create the volcano plot. The file "volcano.R" can be acquired from Supplementary files.
  source("Supplementary files/Scripts/volcano.R")
  volcano(res_limma,"logFC","adj.P.Val",2,0.05)
  NOTE: Genes can be mapped to different positions according to their log2FC and adj-p values, the up regulated DEGs are colored in red, and the down-regulated DEGs are colored in green.
14. Click Export to save the volcano plot.
  NOTE: The volcano plots can be generated and downloaded in different formats (e.g., pdf, TIFF, PNG, JPEG format). Genes can be mapped to different positions according to their log2FC and adj p values, the up-regulated DEGs (log2FC > 2, adj p < 0.05) are colored in red, and the down-regulated DEGs (log2FC < -2, adj p < 0.05) are colored in green, non-DEGs are colored in grey.

3. Differential expression analysis through "edgeR"

Click Run to load the R package "edgeR".
library(edgeR)
Run the following R code to create design matrix.
group <-substring(colnames(cnt),14,15)
group [group %in% "01"] <- "Cancer"
group [group %in% "11"] <- "Normal"
group=factor(group, levels = c("Normal","Cancer"))
design <-model.matrix(~group)
rownames(design) = colnames(cnt)
Click Run to create the DGEList object.
dge <- DGEList(counts=cnt)
Normalize the data.
dge <- calcNormFactors(dge, method = "TMM")
Click Run to estimate the dispersion of gene expression values.
dge <- estimateDisp(dge, design, robust = T)
Click Run to fit model to count data.
fit <- glmQLFit(dge, design)
Conduct a statistical test.
fit <- glmQLFTest(fit)
Extract the result table. The result is saved in "res_edgeR", which includes the log fold change value, log CPM, F, p value and FDR corrected p value.
res_edgeR=as.data.frame(topTags(fit, n=Inf))
head(res_edgeR)
## logFC logCPM F PValue FDR
##GCDH -3.299633 5.802700 458.5991 1.441773e-25 2.979280e-21
##MSMO1 -3.761400 7.521111 407.0416 1.730539e-24 1.787993e-20R
##CL1 -3.829504 5.319641 376.5043 8.652474e-24 5.516791e-20
##ADI1 -3.533664 8.211281 372.6671 1.067904e-23 5.516791e-20
##KCNN2 -5.583794 3.504017 358.6525 2.342106e-23 9.679455e-20
##GLUD1 -3.287447 8.738080 350.0344 3.848408e-23 1.194406e-19
#The result is saved in "res_edgeR", which includes the log fold change value(logFC), log CPM, F, p value and FDR corrected p value
Identify the DEGs.
res_edgeR$sig = as.factor(
ifelse(res_edgeR$FDR < 0.05 & abs(res_edgeR$logFC) > 2,
ifelse(res_edgeR$logFC > 2 ,'up','down'),'not'))
summary(res_edgeR$sig)
##down not up
##1578 15965 3121
Output the result table to a file.
write.csv(res_edgeR, file = 'res_edgeR.csv')
Create the volcano plot.
volcano(res_edgeR,"logFC","FDR",2,0.05)
Click Export to save the volcano plot.

4. Differential expression analysis through "DESeq2"

Click Run to install R packages "DESeq2".
BiocManager::install("DESeq2")
Click Run to load R packages "DESeq2".
library(DESeq2)
Run the following R code to determine the grouping factor.
group <-substring(colnames(cnt),14,15)
group [group %in% "01"] <- "Cancer"
group [group %in% "11"] <- "Normal"
group=factor(group, levels = c("Normal","Cancer"))
Create the DESeqDataSet object.
dds <-DESeqDataSetFromMatrix (cnt, DataFrame(group), design = ~group)
dds
##class: DESeqDataSet
##dim: 20664 45
##metadata(1): version
##assays(1): counts
##rownames(20664): TSPAN6 DPM1 ... RP11-274B21.13 LINC01144
##rowData names(0):
##colnames(45): TCGA-3X-AAVA-01A-11R-A41I-07 ...
##colData names(1): group
Perform the analysis.
dds <- DESeq(dds)
Generate the result table.
res_DESeq2 <- data.frame(results(dds))

head(res_DESeq2)
## baseMean log2FoldChange lfcSE stat pvalue padj
##TSPAN6 4704.9243 -0.8204515 0.3371667 -2.433370 1.495899e-02 2.760180e-02
##DPM1 1205.9087 -0.3692497 0.1202418 -3.070894 2.134191e-03 4.838281e-03
##SCYL3 954.9772 0.2652530 0.2476441 1.071106 2.841218e-01 3.629059e-01
##C1orf112 277.7756 0.7536911 0.2518929 2.992109 2.770575e-03 6.101584e-03
##FGR 345.8789 -0.6423198 0.3712729 -1.730047 8.362180e-02 1.266833e-01
##CFH 27982.3546 -3.8761382 0.5473363 -7.081823 1.422708e-12 1.673241e-11
NOTE: The result is saved in "res_DESeq2", which includes the mean of the normalized read count (baseMean), log fold Change value(log2FoldChange), log fold change standard error (lfcSE), the Wald statistic (stat), original p value (pvalue) and corrected p value (padj)
Identify DEGs.
res_DESeq2$sig = as.factor(
ifelse(res_DESeq2$padj < 0.05 & abs(res_DESeq2$log2FoldChange) > 2,
ifelse(res_DESeq2$log2FoldChange > 2 ,'up','down'),'not'))
summary(res_DESeq2$sig)
##down not up
##1616 16110 2938
Output the result table to a file.
write.csv(res_DESeq2, file = 'res_DESeq2.csv')
Create the volcano plot.
volcano(res_DESeq2,"log2FoldChange","padj",2,0.05)
Click Export to save the volcano plot.

5. Venn diagram

Click Run to install the R package "VennDiagram".
install.packages("VennDiagram")
Click Run to load the R package "VennDiagram".
library (VennDiagram)
Make a Venn diagram of up regulated DEGs.
grid.newpage()
grid.draw(venn.diagram(list(Limma=rownames(res_
limma[res_limma$sig=="up",]),
edgeR=rownames(res_edgeR[res_edgeR$sig=="up",]),
DESeq2=rownames(res_DESeq2[res_DESeq2$sig==
"up",])),
NULL,height = 3,width = 3,units = "in",
col="black",lwd=0.3,fill=c("#FF6666","#FFFF00",
"#993366"),
alpha=c(0.5, 0.5, 0.5),main = "Up-regulated DEGs"))
Click Export to save the Venn diagram.
Make a Venn diagram of down regulated DEGs.
grid.newpage()
grid.draw(venn.diagram(list(Limma=rownames(res_
limma[res_limma$sig=="down",]),
edgeR=rownames(res_edgeR[res_edgeR$sig==
"down",]),
DESeq2=rownames(res_DESeq2[res_DESeq2$sig=="down",])),
NULL,height = 3,width = 3,units = "in",
col="black",lwd=0.3,fill=c("#FF6666","#FFFF00",
"#993366"),
alpha=c(0.5, 0.5, 0.5),main = "Down-regulated DEGs"))
Click Export to save the Venn diagram.

Wyniki

There are various approaches to visualize the result of differential expression analysis, among which the volcano plot and Venn diagram are particularly used. limma identified 3323 DEGs between the CHOL and normal tissues with the |logFC|≥2 and adj.P.Val <0.05 as thresholds, among which 1880 were down-regulated in CHOL tissues and 1443 were up-regulated (Figure 1a). Meanwhile, edgeR identified the 1578 down-regulated DEGs and 3121 up-regulated DEGs (Figure 1b

Dyskusje

Abundant aberrate transcripts in cancers can be easily identified by RNA-seq differential analysis⁵. However, the application of RNA-seq differential expression analysis is often restricted as it requires certain skills with R language and the capacity to choose appropriate methods. To address this problem, we provide a detailed introduction to the three most known methods (limma, EdgeR and DESeq2) and tutorials for applying the RNA-seq differential expression analysis. This will facilitate the un...

Ujawnienia

The manuscript has not been published before and is not being considered for publication elsewhere. All authors have contributed to the creation of this manuscript for important intellectual content and read and approved the final manuscript. We declare there is no conflict of interest.

Podziękowania

This work was supported by the National Natural Science Foundation of China (Grant No. 81860276) and Key Special Fund Projects of National Key R&D Program (Grant No. 2018YFC1003200).

Materiały

Name	Company	Catalog Number	Comments
R		version 3.6.2	free software
Rstudio			free software

Odniesienia

Tambonis, T., Boareto, M., Leite, V. B. P. Differential Expression Analysis in RNA-seq Data Using a Geometric Approach. Journal of Computational Biology. 25, 1257-1265 (2018).
Wang, Z., Gerstein, M., Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics. 10, 57-63 (2009).
Anders, S., et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols. 8, 1765-1786 (2013).
McDermaid, A., Monier, B., Zhao, J., Liu, B., Ma, Q. Interpretation of differential gene expression results of RNA-seq data: review and integration. Briefings in Bioinformatics. 20, 2044-2054 (2019).
Costa-Silva, J., Domingues, D., Lopes, F. M. RNA-Seq differential expression analysis: An extended review and a software tool. PloS One. 12, 0190152 (2017).
Law, C. W., et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research. 5, (2016).
Varet, H., Brillet-Guéguen, L., Coppée, J. Y., Dillies, M. A. SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data. PloS One. 11, 0157022 (2016).
Ritchie, M. E., et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 43, 47 (2015).
Robinson, M. D., McCarthy, D. J., Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 26, 139-140 (2010).
Love, M. I., Huber, W., Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 15, 550 (2014).
Gentleman, R. C., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 5, 80 (2004).
Law, C. W., Chen, Y., Shi, W., Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology. 15, 29 (2014).
Smyth, G. K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 3, (2004).
Lund, S. P., Nettleton, D., McCarthy, D. J., Smyth, G. K. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology. 11, (2012).
Reeb, P. D., Steibel, J. P. Evaluating statistical analysis models for RNA sequencing experiments. Frontiers in Genetics. 4, 178 (2013).
Rocke, D. M., et al. Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data. bioRxiv. , (2015).
Agarwal, A., et al. Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC genomics. 11, 383 (2010).
Leng, N., et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 29, 1035-1043 (2013).

Przedruki i uprawnienia

Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE

Zapytaj o uprawnienia

Przeglądaj więcej artyków

Differential Expression Analysis RNA Sequencing Limma EdgeR DESeq2 High throughput Sequencing Cholangiocarcinoma Cancer Genome Atlas RStudio Data Pre processing Gene Symbols Ensemble Gene IDs Annotation File Low expressed Genes DGEList Object Design Matrix Linear Model

This article has been published

Video Coming Soon

Keep me updated: