Welcome to the protocol of high-throughput transcriptome analysis for investigating host-pathogen interactions. This protocol is divided in the following steps. Quality control to filter low-quality reads and also to remove adapter sequences Sequencing and annotations, where are you have to map the reads into a reference genomes and annotate the reads into the genes.
Statistical and co-expression analysis, which defines the differentially expressed genes and also finds the co-expression modules. Molecular degree of perturbation analysis to find potential outlier samples. And finally, the functional analysis to determine the biological functions of differentially expressed genes.
All the tools utilizing these pipelines were pre-installed in a Linux system and encapsulated into a Docker container. The samples utilizing these protocols derived from a paper published by our group in PLOS Pathogen. The samples comprise 20 healthy people and 39 patients infected with Chikungunya virus.
The blood samples were collected, and RNA sequencing was performed. To install Docker in Windows system, you have to follow these steps. Go to the official webpage of Docker, and click in Get Started.
Find the installer for Docker Desktop for windows. Download the file. Install locally in your machine.
Make sure that these two options are marked. After installing the program, downloads the Docker image for this protocol. Go to the Windows terminal.
Execute the commands to downloads the image. After downloading the image, you can see the file in the Docker desktop, and from this image, we can initiate the container. After you click in the round button, you have to expand the original parameters and options to define the name of the container and to associate a folder in your local computer with the folder inside Docker.
After this, you click in Run to initiate the container. You can then access the terminal, which is in the Linux system inside the Docker. Type the bash commands, and then you can execute all the commands of this protocol.
First, we have to execute the source to make all the tools of this protocol available. You should access the directory scripts. To perform a transcriptomic analysis, you have to download first the reference genome.
For this, you have to execute the following commands. After the genome is download, you have to download the annotation of the genes. To do this, you have to type the following commands.
Next, you have to configure the fastq-dump. This is allow you to downloads the sequencing files of the examples. After typing the following commands, you have to use the Tab button to go to the Tools option and to mark the options currents directory.
Use the Tab buttons to save, and then ok. And then exit the tool fastq-dump. Now we can initiate the downloads of the reads by typing the following commands.
The quality control consists and evaluates graphically the probability of errors in the sequencing reads. In this step, you have also to remove the technical sequences such as adapters. To generate the quality control graphs, you have to run the FastQC program.
To remove the adapter sequences and the low-quality sequences, you have to type the following commands. With the good-quality reads, we have now to map the reads into the reference genome. After the mapping, we are gonna have to annotate the genes according to the human genes and then count the number of reads that match each human gene.
The first step is to index the reference genome by typing the following command. And then we type this commands to map the reads into the human genome. Next, you should run the scripts that annotate the reads.
After mapping and annotation, you can perform the differential expression analysis which it consists in finding the genes whose expression is higher or lower in one group compared to another. To identify the differentially expressed genes, or DEGs, you have to run following commands. After this, you can transfer the data results from the Docker to your local computer.
For this, go to the terminal and type the following commands to save all the results to a local folder. To perform the remaining analysis, you also have to copy all the files of the directory data to a directory in your local computer. In your local computer, you will be able to see the directories where you saved the data from Docker.
As you can see, you can access all the libraries. You can also open the HTML file containing the quality control reports. You can also access a directory containing the differentially expressed genes.
And inside this directory, you will find the volcano plots where you can see the genes that are up-or downregulated in the one group versus another, in this case, patients infected with Chikungunya virus versus healthy controls. All the remaining steps of this protocol are gonna be executed in web tools using your browser. Let's first start with CEMiTool.
Go to the browser and type the following address. CEMiTool identifies co-expression modules from expression data sets provided by the users. In the main page, you can go to the menu and click in the button Run.
This will open a new page where you can upload the expression file. This file is in the directory data of your local computer. You will see that there three expression files, and the one that we are gonna use for the CEMiTool is a normalization call tmm.
Then you have to select the phenodata file, the same thing for the file containing the protein-protein interactions, and finally, upload the file containing the gene sets or pathways. The gene sets file enables CEMiTool to perform enrichment analysis for each one of the co-expression module. Next, you should to expand the parameter section and click in Apply VST.
After that, you can just click Run CEMiTool. After you run CEMiTool, you will see that 12 co-expression modules were identified. By clicking here, you can download all the results of these analysis.
Another tool that we are gonna utilize in this protocol is MDP, or Molecular Degree of Perturbation. Just type in your browser mdp.sysbio.tools. MDP calculates the molecular distance of each sample compared to a reference group of samples, in this case, the healthy controls, in order to find not only potential outliers but also how perturbed are each samples compared to this group.
In the Run page, you can just upload the expression file by clicking the button and selecting the file. Then you have to upload the phenodata file. Then you have to define which column contain the information about the group or the class and then which class or group correspond to the control group.
After this, you can just run MDP. The bar graph shows for each one of the samples as a bar the score of molecular degree of perturbation, and the colors represent the different groups. And the box plot is another way of visualizing the same results where you see on each dots the is a different samples separate by two groups.
To perform the functional analysis, we are gonna use the Enrichr tool. For this, you have to select the list of genes that were differentially expressed, either up-or downregulated, and use it as a input gene list in Enrichr tool. You will see that there are different tabs.
All the results can also be downloaded to your local computer. The computer environment for transcriptome analysis has been placed on the Docker platform. This approach allows users with no prior experience with Linux system to utilize a terminal.
In this container, there is a predefined folder structure for dataset and scripts which are necessary for all the analysis. In the pipeline, users will utilize blood transcriptome data from 20 healthy individuals and 39 patients acutely infected with Chikungunya virus. The sequencing platform returns a set of FASTQ files containing the DNA sequence, i.e.
the reads, and the associated quality for each nucleotide base. The Phred quality scale indicates the probability of an incorrect reading for each base. Tools identify and remove low-quality reads from samples and to increase the probability of mapping reads.
In this step, the mapping module, the high-quality reads recovered are used as inputs to align them against the human reference genome. CEMiTool identifies and analyze co-expression modules. Genes within the same module are co-expressed, which means that they exhibit similar patterns of expression across the samples of the data sets.
The network analysis provides information about the most connected genes, i.e. the hubs. The names of those genes are shown in the network.
The size of the nodes is proportional to its degree of connectivity. The results obtained from the DEG analysis were summarized in the volcano plots. The analysis of the molecular degree of perturbation permits the identification of perturbed samples from healthy and infected individuals.
MDP suggests which samples are potential biological outliers. Removing those samples will impact the downstream results. A functional enrichment analysis using AURA can be performed with Enrichr tool.
These steps helps to interpret the results by revealing common functional roles of several genes that were differentially expressed. The biological process shown in the bar graphs are the top 10 enriched gene sets based on their p-value ranking. In conclusion, these protocols covers all steps of RNA-Seq analysis.
The pipeline was developed and encapsulated into the non-commercial system named Docker. On an image and made available for the scientific community. Due to the container system, all scripts and tools are under the same specific version to guarantee reproducibility.
Furthermore, parts of the bioinformatics analysis was performed via free user-friendly web tools.