This protocol provides a comprehensive understanding of gene isoforms generated by alternative splicing and polyadenylation by providing a step-by-step workflow to identify differential splicing sites, differentially-expressed exons, and poly(A)sites. The main advantage of this protocol is that it evaluates both exon-based and event-based methods for studying alternative splicing. It also applies exon-based method to study alternative polyadenylation.
The R Markdown files that include the codes and notes for AS and AP analysis have been provided. It would be advisable to follow the steps in the R Markdown file and reach the note for each step carefully. To identify differential splicing using diffSplice from limma, follow the R notebook file.
Prepare the input files as described in the text manuscript. Ensure steps one through three in the manuscript have been followed sequentially to prepare input files before proceeding further. Begin by loading the necessary libraries.
To perform non-specific filtering, first extract the matrix of read counts obtained previously and create a list of features using the DGEList function from the edgeR package, where rows represent genes and columns represent samples. Then, transform the data from raw scale to counts per million using the CPM function from the edgeR package, and keep exons with counts greater than a settable threshold. This data set contains six samples.
Hence, the CPM is set at greater than one and at least three samples out of six. Normalize the counts across samples with the calcNormFactors function from the edgeR package using Trimmed Mean of M values. This function will compute scaling factors to adjust library sizes.
Use the previously-generated sample table to create the design matrix to define the experimental conditions for each sample. Run the voom function of the limma package to process RNA sequencing data to estimate the variance. This function will generate precision weights to correct for Poisson count noise and transform the exon level counts to log two counts per million or logCPM.
Run the lmfit function to fit linear models to the expression data for each exon. Then run the eBayes function to compute empirical-based statistics for the fitted model to detect differential exon expression. Define a contrast matrix for the experimental comparisons of interest.
Use the contrasts. fit function to obtain coefficients and standard errors for each pair of comparisons. Run diffSplice on the fitted model to test the differences in exon usage of genes between wild type and knockout.
Explore the top-ranked results using the topSplice function where a test equal to t gives a ranking of AS exons and test equal to simes gives a ranking of genes. Run the plotSplice function to plot the results. In putting the gene of interest in the gene ID argument, the red points show the differentially-expressed exons.
Generate a volcano plot using EnhancedVolcano bioconductor package to exhibit the differentially-expressed exons. To use rMATS, ensure the latest version of rMATS version 4.1.1 is installed either using conda or GitHub in the working directory. Go to the folder containing bam files obtained after mapping.
Prepare text files as required by rMATS for the two conditions of copying the name of bam files and their path separated by a comma. Run rmas. py using the two generated input text files describing the path of the bam files and the annotation.
gtf file obtained previously. This generates an output folder rmats_out containing text files describing statistics including P-values and inclusion levels for each splicing event separately. Use the bioconductor package maser to explore the rMATS results.
Load the junction and exon counts text files with the extension JCEC into the maser object and include at least five average reads per splicing event to filter the result based on coverage. To visualize the rMATS results, first run the topEvents function from the maser package, selecting the significant splicing events at a false discovery rate of 10%and a minimum 10%change in percent spliced in or PSI. Check the gene events for individual genes of interest and plot PSI values for each splicing event of that gene.
Generate a volcano plot by specifying the event type. Use the splicing events results obtained with rMATS in the form of text files to generate sashimi plots using the rmats2sashimiplot package. The sashimi plot shows a skipped exon event in the Wnk1 gene.
Each row represents an RNA-seq sample, three replicates of wild type and Mbnl1 knockout. The height shows read coverage in RPKM and the connecting arcs depict junction reads across exons. The bottom part shows annotated gene model alternative isoforms.
A substantial fold change and strong statistical evidence of genuine differences can be observed in the genes located at the top left or right quadrants of the volcano plots obtained using diffSplice and DEXSeq. A cassette exon was found to be varying between different conditions for the gene Wnk1. The differential exon usage plot showed evidence of differential splicing at five exon sites near Wnk1.6.45, with the exons highlighted in pink likely to be spliced out in Mbnl1 knockout samples compared to wild type.
The volcano plot of genes that are alternatively spliced helped to distinguish between the genes that were excluded from wild type and those that were included in wild type. The types of splicing events SE, A5SS, A3SS, MXE, and RI were visualized using sashimi plots of the top significant genes of those events. The differential APA activity in three prime untranslated regions of genes was observed using volcano plots.
The significantly-differential PA site usage results acquired from different pipelines were visualized using event plot. A significant distal to proximal shift of PA site usage in double knockouts can be observed in both genes FOSL1 and Papola. The mean coverage in flanking regions anchored at known PA cleavage sites on the genome-wide level was determined using a diagnostic plot.
Ensure the parameters such as transpecific information and allow multi overlap are correctly used when generating count metrics. Linear model fitting and generating contrast pairs is important for proper comparison. For rMATS, ensure all the parameters are set correctly according to your data before running the command.
The genes obtained from differential splicing activity could be used to perform gene set enrichment analysis. Another tool called MISO could be used for further event-based analysis.