The samples used in this protocol were approved by the ethics committees from both the Department of Microbiology of the Institute of Biomedical Sciences at the University of São Paulo and the Federal University of Sergipe (Protocols: 54937216.5.0000.5467 and 54835916.2.0000.5546, respectively).
1. Docker desktop installation
NOTE: Steps to prepare the Docker environment are different among the operating systems (OSs). Therefore, Mac users must follow steps listed as 1.1, Linux users must follow steps listed as 1.2, and Windows users must follow steps listed as 1.3.
- Install on the MacOS.
- Access the Get Docker website (Table of Materials), click on Docker Desktop for Mac and then click on the Download from Docker Hub link.
- Download the installation file by clicking on the Get Docker button.
- Execute the Docker.dmg file to open the installer, and then drag the icon to the Applications folder. Localize and execute the Docker.app in the Applications folder to start the program.
NOTE: The software specific menu in the top status bar indicates that the software is running and that it is accessible from a terminal.
- Install the container program on the Linux OS.
- Access the Get Docker Linux website (Table of Materials) and follow the instructions for installing using the repository section available on the Docker Linux Repository link.
- Update all Linux packages using the command line:
sudo apt-get update
- Install the required packages to Docker:
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
- Create a software archive keyring file:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
- Add Docker deb information in the source.list file:
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
- Update all the packages again, including the ones recently added:
sudo apt-get update
- Install the desktop version:
sudo apt-get install docker-ce docker-ce-cli containerd.io
- Select the geographic area and time zone to finish the installation process.
- Install the container program on the Windows OS.
- Access the Get Docker website (Table of Materials) and click on Get Started. Find the installer for Docker Desktop for Windows. Download the files and install them locally on the computer.
- After the download, start the installation file (.exe) and keep the default parameters. Make sure that the two options Install Required Windows Components for WSL 2 and Add Shortcut to Desktop are marked.
NOTE: In some cases, when this software tries to start the service, it shows an error: WSL installation is incomplete. To figure out this error, access the website WSL2-Kernel (Table of Materials).
- Download and install the latest WSL2 Linux kernel.
- Access PowerShell terminal as Administrator and execute the command:
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
- Ensure that the software Docker Desktop is installed successfully.
- Download the image from CSBL repository on the Docker hub (Table of Materials).
- Open the Docker Desktop and verify that the status is "running" at the bottom left of the toolbar.
- Go to the Windows PowerShell terminal command line. Download the Linux Container image for this protocol from the CSBL repository on the Docker hub. Execute the following command to download the image:
docker pull csblusp/transcriptome
NOTE: After downloading the image, the file can be seen in the Docker Desktop. To create the container, Windows users must follow step 1.5, while Linux users must follow step 1.6.
- Initialize the server container on the Windows OS.
- View the Docker image file in the Desktop App manager from the Toolbar and access the Images page.
NOTE: If the pipeline image was downloaded successfully, there will be a csblusp/transcriptome image available.
- Initiate the container from the csblusp/transcriptome image by clicking on the Run button. Expand the Optional Settings to configure the container.
- Define the Container Name (e.g., server).
- Associate a folder in the local computer with the folder inside the docker. To do this, determine the Host Path. Set a folder in the local machine to store the processed data that will be downloaded at the end. Set the Container Path. Define and link the csblusp/transcriptome container folder to the local machine path (use the name "/opt/transferdata" for the Container Path).
- After this, click on Run to create the csblusp/transcriptome container.
- To access the Linux terminal from the csblusp/transcriptome container, click on the CLI button.
- Type in the bash terminal to have a better experience. For this, execute the command:
bash
- After executing the bash command, ensure that the terminal shows (root@<containerID>:/#):
root@ac12c583b731:/#
- Initialize the server container for Linux OS.
- Execute this command to create the Docker container based on the image:
docker run -d -it --rm --name server -v <Host Path>:/opt/transferdata csblusp/transcriptome
NOTE: <host path>: define a path of the local folder machine.
- Execute this command to access the command terminal of the Docker container:
docker exec -it server bash
- Ensure availability of a Linux terminal to execute any programs/scripts using the command line.
- After executing the bash command, ensure that the terminal shows (root@<containerID>:/#):
root@ac12c583b731:/#
NOTE: The root password is "transcriptome" by default. If desired, the root password can be changed by executing the command:
passwd
- First, execute the source command to addpath.sh to ensure all tools are available. Execute the command:
source /opt/addpath.sh
- Check the structure of the RNA sequencing folder.
- Access the transcriptome pipeline scripts folder and ensure all data from RNA sequencing are stored inside the folder: /home/transcriptome-pipeline/data.
- Ensure all the results obtained from the analysis are stored inside the folder of the path /home/transcriptome-pipeline/results.
- Ensure genome and annotation reference files are stored inside the folder of the path /home/transcriptome-pipeline/datasets. These files will help to support all analysis.
- Ensure all scripts are stored in the folder of the path /home/transcriptome-pipeline/scripts and separated by each step as described below.
- Download the annotation and the human genome.
- Access the scripts folder:
cd /home/transcriptome-pipeline/scripts
- Execute this command to download the reference human genome:
bash downloadGenome.sh
- To download the annotation, execute the command:
bash downloadAnnotation.sh
- Change the annotation or the version of the reference genome.
- Open downloadAnnotation.sh and downloadGenome.sh to change the URL of each file.
- Copy the downloadAnnotation.sh and downloadGenome.sh files to the transfer area and edit in the local OS.
cd /home/transcriptome-pipeline/scripts
cp downloadAnnotation.sh downloadGenome.sh /opt/transferdata
- Open the Host Path folder, which is selected to link between host and Docker container in step 1.5.4.
- Edit the files using the preferred editor software and save. Finally, put the modified files into the script folder. Execute the command:
cd /opt/transferdata
cp downloadAnnotation.sh downloadGenome.sh /home/transcriptome-pipeline/scripts
NOTE: These files can be edited directly using vim or nano Linux editor.
- Next, configure the fastq-dump tool with the command line:
vdb-config --interactive
NOTE: This allows to download sequencing files from the example data.
- Navigate the Tools page using the tab key and select the current folder option. Navigate to the Save option and click on OK. Then, Exit the fastq-dump tool.
- Initiate the download of the reads from the previously published paper7. The SRA accession number of each sample is required. Obtain the SRA numbers from the SRA NCBI website (Table of Materials).
NOTE: To analyze RNA-Seq data available on public databases, follow step 1.12. To analyze private RNA-seq data, follow step 1.13.
- Analyze specific public data.
- Access the National Center for Biotechnology Information (NCBI) website and seek keywords for a specific subject.
- Click on the Result link for BioProject in the Genomes section.
- Choose and click on a specific study. Click on the SRA Experiments. A new page opens, which shows all the samples available for this study.
- Click on the "Send to:" above accession number. In the "Choose Destination" option select File and Format option, select RunInfo. Click on "Create File" to export all library information.
- Save the SraRunInfo.csv file in the Host path defined in the 1.5.4 step and execute the download script:
cp /opt/transferdata/SraRunInfo.csv /home/transcriptome-pipeline/data
cd /home/transcriptome-pipeline/scripts
bash downloadAllLibraries.sh
- Analyze private and unpublished sequencing data.
- Organize the sequencing data in a folder named Reads.
NOTE: Inside the Reads folder, create one folder for each sample. These folders must have the same name for each sample. Add data of each sample inside its directory. In case it is a paired-end RNA-Seq, each sample directory should contain two FASTQ files, which must present names ending according to the patterns {sample}_1.fastq.gz and {sample}_2.fastq.gz, forward and reverse sequences, respectively. For example, a sample named "Healthy_control" must have a directory with the same name and FASTQ files named Healthy_control_1.fastq.gz and Healthy_control_2.fastq.gz. Nevertheless, if the library sequencing is a single-end strategy, only one reads file should be saved for downstream analysis. For instance, the same sample, "Healthy control", must have a unique FASTQ file named Healthy_control.fastq.gz.
- Create a phenotypic file containing all sample names: Name the first column as 'Sample' and the second column as 'Class'. Fill the Sample column with sample names, which must be the same name for the sample directories and fill the Class column with the phenotypic group of each sample (e.g., control or infected). Finally, save a file with the name "metadata.tsv" and send it to the /home/transcriptome-pipeline/data/ directory. Check out the existing metadata.tsv to understand the format of the phenotypic file.
cp /opt/transferdata/metadata.tsv
/home/transcriptome-pipeline/data/metadata.tsv
- Access the Host Path directory defined in step 1.5.4 and copy the new structured directories samples. Finally, move the samples from /opt/transferdata to the pipeline data directory.
cp -rf /opt/transferdata/reads/*
/home/transcriptome-pipeline/data/reads/
- Observe that all reads are stored in the folder /home/transcriptome-pipeline/data/reads.
2. Quality control of the data
NOTE: Evaluate, graphically, the probability of errors in the sequencing reads. Remove all the technical sequences, e.g., adaptors.
- Access the sequencing quality of libraries with the FastQC tool.
- To generate the quality graphs, run the fastqc program. Execute the command:
bash FastQC.sh
NOTE: The results will be saved in the /home/transcriptome-pipeline/results/FastQC folder. Since sequence adapters are used for library preparation and sequencing, in some cases the fragments of adapters sequence can interfere with the mapping process.
- Remove the adapter sequence and low-quality reads. Access the Scripts folder and execute the command for the Trimmomatic tool:
cd /home/transcriptome-pipeline/scripts
bash trimmomatic.sh
NOTE: The parameters used for sequencing filter are: Remove leading low quality or 3 bases (below quality 3) (LEADING:3); Remove trailing low quality or 3 bases (below quality 3) (TRAILING:3); Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 20 (SLIDINGWINDOW:4:20); and Drop reads below the 36 bases long (MINLEN:36). These parameters could be altered by editing the Trimmomatic script file.
- Ensure that the results are saved in the following folder: /home/transcriptome-pipeline/results/trimreads. Execute the command:
ls /home/transcriptome-pipeline/results/trimreads
3. Mapping and annotation of samples
NOTE: After obtaining the good quality reads, these need to be mapped to the reference genome. For this step, the STAR mapper was used to map the example samples. The STAR mapper tool requires 32 GB of RAM memory to load and execute the reads and genome mapping. For users who don't have 32 GB of RAM memory, already mapped reads can be used. In such cases jump to step 3.3 or use the Bowtie2 mapper. This section has scripts for STAR (results shown in all figures) and Bowtie2 (low-memory required mapper).
- First index the reference genome for the mapping process:
- Access the Scripts folder using the command line:
cd /home/transcriptome-pipeline/scripts
- For STAR mapper, execute:
bash indexGenome.sh
- For Bowtie mapper, execute:
bash indexGenomeBowtie2.sh
- Execute the following command to map filtered reads (obtained from step 2) to the reference genome (GRCh38 version). Both STAR and Bowtie2 mappers are performed using default parameters.
- For STAR mapper, execute:
bash mapSTAR.sh
- For Bowtie2 mapper, execute:
bash mapBowtie2.sh
NOTE: The final results are Binary Alignment Map (BAM) files to each sample stored in /home/transcriptome-pipeline/results/mapreads.
- Annotate mapped reads using the FeatureCounts tool to obtain raw counts for each gene. Run the scripts that annotate the reads.
NOTE: The FeatureCounts tool is responsible to assign mapped sequencing reads to the genomic features. The most important aspects of genome annotation that can be changed following the biological question include, detection of isoforms, multiple mapped reads and exon-exon junctions, corresponding to the parameters, GTF.attrType="gene_name" for gene or not specify the parameters for meta-feature level, allowMultiOverlap=TRUE, and juncCounts=TRUE, respectively.
- Access the scripts folder using command line:
cd /home/transcriptome-pipeline/scripts
- To annotate the mapped reads to obtain raw counts per gene, execute the command line:
Rscript annotation.R
NOTE: The parameters used for the annotation process were: return gene short name (GTF.attrType="gene_name"); allow multiple overlaps (allowMultiOverlap = TRUE); and indicate the library is paired-end (isPairedEnd=TRUE). For single-end strategy, use the parameter isPairedEnd=FALSE. The results will be saved in the /home/transcriptome-pipeline/countreads folder.
- Normalize gene expression.
NOTE: Normalizing gene expression is essential to compare results between outcomes (e.g., healthy and infected samples). Normalization is also required to perform the co-expression and molecular degree of perturbation analyses.
- Access the Scripts folder using the command line:
cd /home/transcriptome-pipeline/scripts
- Normalize the gene expression. For this, execute the command line:
Rscript normalizesamples.R
NOTE: The raw counts expression, in this experiment, were normalized using Trimmed Mean of M-values (TMM) and Count Per Million (CPM) methods. This step aims to remove differences in gene expression due to the technical influence, by doing library size normalization. The results will be saved in the /home/transcriptome-pipeline/countreads folder.
4. Differentially expressed genes and co-expressed genes
- Identify differentially expressed genes using the open-source EdgeR package. This involves finding genes whose expression is higher or lower compared to the control.
- Access the Scripts folder using the command line:
cd /home/transcriptome-pipeline/scripts
- To identify the differentially expressed gene, execute the DEG_edgeR R script using the command line:
Rscript DEG_edgeR.R
NOTE: The results containing the differentially expressed genes will be saved in the /home/transcriptome-pipeline/results/degs folder. Data can be transferred to a personal computer.
- Download data from the csblusp/transcriptome container.
- Transfer processed data from the /home/transcriptome-pipeline to the /opt/transferdata folder (local computer).
- Copy all files to the local computer by executing the command line:
cp -rf /home/transcriptome-pipeline/results /opt/transferdata/pipeline
cp -rf /home/transcriptome-pipeline/data /opt/transferdata/pipeline
NOTE: Now, go to the local computer to ensure all the results, datasets, and data are available to download in the Host Path.
- Identify co-expression modules.
- Access the Co-Expression Modules Identification Tool (CEMiTool) website (Table of
Materials). This tool identifies co-expression modules from expression datasets provided by the users. On the main page, click on Run at the top right. This will open a new page to upload the expression file.
- Click on Choose File below the Expression File section and upload the normalized gene expression matrix 'tmm_expression.tsv' from the Host Path.
NOTE: Step 4.4. is non-mandatory.
- Explore the biological meaning of co-expression modules.
- Click on Choose File in the Sample Phenotypes section and upload the file with sample phenotypes metadata_cemitool.tsv from the Download data step 4.2.2. to perform a gene set enrichment analysis (GSEA).
- Press Choose File in the Gene Interactions section to upload a file with gene interactions (cemitool-interactions.tsv). It is possible to use the file of gene interactions provided as an example by webCEMiTool. The interactions can be protein-protein interactions, transcription factors and their transcribed genes, or metabolic pathways. This step produces an interaction network for each co-expression module.
- Click on the Choose File in the Gene Sets section to upload a list of genes functionally related in a Gene Matrix Transposed (GMT) format file. The Gene Set file enables the tool to perform enrichment analysis for each co-expression module, i.e, an over-representation analysis (ORA).
NOTE: This list of genes can encompass pathways, GO terms, or miRNA-target genes. The researcher can use the Blood Transcription Modules (BTM) as gene sets for this analysis. The BTM file (BTM_for_GSEA.gmt).
- Set parameters for performing co-expression analyses and obtain its results.
- Next expand the Parameter section, by clicking on the plus sign to exhibit the default parameters. If necessary, change them. Check the Apply VST box.
- Write the e-mail in the Email section to receive results as an email. This step is optional.
- Press the Run CEMiTool button.
- Download the full analysis report by clicking on the Download Full Report at the top right. It will download a compressed file cemitool_results.zip.
- Extract the contents of the cemitool_results.zip with WinRAR.
NOTE: The folder with the extracted contents encompasses several files with all results of the analysis and their established parameters.
5. Determination of the molecular degree of perturbation of samples
- Molecular Degree of Perturbation (MDP) web version.
- To run MDP, access the MDP website (Table of Materials). MDP calculates molecular distance of each sample from the reference. Click on the Run button.
- On the Choose File link, upload the expression file tmm_expression.tsv. Then, upload the phenotypic data file metadata.tsv from the Download data step 4.2.2. It is also possible to submit a pathway annotation file in GMT format to calculate the perturbation score of the pathways associated with the disease.
- Once the data are uploaded, define the Class column that contains the phenotypic information used by the MDP. Then, define the control class by selecting the label that corresponds to the control class.
NOTE: There are some optional parameters that will affect how the sample scores are calculated. If necessary, the user is able to change the statistics average method, standard deviation, and top percentage of the perturbed genes.
- After that, press the Run MDP button and the MDP results will be shown. The user can download the figures by clicking on the Download Plot in each plot, as well as the MDP score on the Download MDP Score File button.
NOTE: In case of questions about how to submit the files or how MDP works, just go through the Tutorial and About webpages.
6. Functional enrichment analysis
- Create one list of down-regulated DEGs and another of up-regulated DEGs. Gene names must be according to Entrez gene symbols. Each gene of the list must be placed on one line.
- Save the gene lists in the txt or tsv format.
- Access the Enrichr website (Table of Materials) to perform the functional analysis.
- Select the list of genes by clicking on the Choose File. Select one of the DEGs list and press the Submit button.
- Click on Pathways at the top of the webpage to perform functional enrichment analysis with the ORA approach.
- Choose a pathway database. "Reactome 2016" pathway database is broadly used to get the biological meaning of human data.
- Click on the name of the pathway database again. Select Bar Graph and check whether it is sorted by p-value ranking. If not, click on the bar graph until it is sorted by p-value. This bar graph includes the top 10 pathways according to p-values.
- Press the Configuration button and select the red color for the up-regulated genes analysis or blue color for the down-regulated genes analysis. Save the bar graph in several formats by clicking on svg, png, and jpg.
- Select Table and click on Export Entries to the Table at the bottom left of the bar graph to obtain the functional enrichment analysis results in a txt file.
NOTE: This functional enrichment results file encompasses in each line the name of one pathway, the number of overlapped genes between the submitted DEG list and the pathway, the p-value, adjusted p-value, odds ratio, combined score, and the gene symbol of genes present in the DEG list that participate in the pathway.
- Repeat the same steps with the other DEGs list.
NOTE: The analysis with down-regulated DEGs provides pathways enriched for down-regulated genes and the analysis with up-regulated genes provides pathways enriched for up-regulated genes.