The overall goal of this procedure is to identify the appropriate animal model for translational research questions. This method can help answer one key question in translational research and that is, how can I select an appropriate animal model for a specific human disease that I want to investigate? The main advantage of this technique is that whole genome data are compared between animal models and human disease studies.
So, this approach avoids biased interpretation relating to single gene comparisons. Though this method provides insight into the suitability of mouse models for inflammatory diseases, it can also be applied to other disease models. GSEA helps to investigate policy regulation for gene expression studies.
The output for every animal model and disease study is the base for further comparisons. Begin this procedure with software and data download followed by data handling and formatting as described in the text protocol. Then open the GSEA software tool.
Click the Load Data button on the left side of the main window. A new tab will open for importing the required data files. In the new tab, browse to the gene expression data file and the phenotype file.
In case GSEA cannot connect to the internet, also load the downloaded molecular signature database files and the DNA chip annotations files. Successfully imported data appears in the load data section. Click the Run GSEA button on the left side of the main window.
A new tab will open in order to set the parameters for the analysis. The tab is subdivided into three parts, required fields, basic fields, and advanced fields. In the required fields, first choose the expression data set.
Then choose the gene sets database either from the connected website or from the manually imported gene set file. Edit the phenotype labels to select the groups of samples that are supposed to be compared to each other. For example, disease group versus healthy control group.
Next, select collapse dataset to gene symbols equals true in order to translate the probe identifiers in the expression dataset to official Hugo gene symbols used in the gene sets database. Select false if the expression dataset already contains Hugo gene symbols. Set the number of permutations to the default setting at 1, 000.
Change the permutation type to gene set since phenotype permutation is only recommended when there are more than seven samples in every phenotype. Finally, select the chip platform used for generating the gene expression data either from the connected website or from the manually imported DNA chip annotations file. In the basic fields, edit the analysis name and the save results in this folder section.
In addition, further statistical parameters can be changed. For further details on the parameters and the advanced fields section, please go to the GSEA user guide. If externally calculated group metrics for gene expression data are applied, use a GSEA pre-ranked tool.
This analysis is conducted based on a simple list of genes pre-assigned to pre-calculated group metrics that are used for ranking the genes. After loading the alternative gene expression file, go to the main navigation bar and click on Tools, GseaPreranked. Similarly, a new tab will open for setting the parameters for the analysis.
Next, click the Run button on the right bottom of the window. Click on the succeeded analysis in the GSEA report section to open the analysis results. Next, click on the detailed enrichment results in Excel format to export the analysis results to a spreadsheet.
Export the results separately for both phenotypes. In Excel, join the results data in one spreadsheet file. For subsequent comparison between gene expression data of several studies, maintain at least the name of the gene set, its normalized enrichment score and its FDR value.
Repeat the gene set enrichment analysis for the second study and for all further studies that are meant to be compared to each other. Include as many human clinical studies and different mouse models as possible to identify the optimal mouse model for the translational research question. To identify the optimal animal model for mimicking the human situation, compare the GSEA results of all studies to each other.
Use the enrichment scores and the FDR values to classify the pathways as activated, inhibited, or neither. For many studies to be compared, it is recommended to use R scripts. For each comparison of two studies, count the number of realizations of the nine possible combinations of pathway regulation as indicated by a three by three contingency table.
Assess the correlation between two studies by calculation of the positive predictive value and the negative predictive value which is by definition the part of the pathways that show the same regulation in two studies. Also, estimate the part of pathways that were expected to correlate just by chance, represented by ppv chance and npv chance. Then calculate the gain of information.
All calculations can be done by using spreadsheet programs, but the use of R functions is recommended. Use the contingency table of a pair of studies to calculate the P value with the chi-squared test, for example by using the R function, chi-squared test, or spreadsheet programs. Store the data of the contingency table in a matrix X.Next, compare the GSEA results for all combinations of the studies that were selected for the analysis.
Sort all combinations by the gain of information. For the comparison of many datasets, use a matrix and visualize the findings by use of a colored heat map. Select the animal model with the highest gain of information.
In order to assess the significance of the gain of information, also take the chi-squared test into account. A correlation matrix of pathway comparisons between human and mouse studies is demonstrated here. The overlap of pathway regulation is shown as the gain of information that can be obtained from one study to predict the effects in another study.
Blue means low correlation and red means high correlation between the data. The comparison of human with murine datasets revealed a subgroup of murine models that were highly correlative to human studies. Therefore, these mouse models are best suited for mimicking the human situation.
In contrast, the studies seven, eight, and nine showed no correlation to the human disease studies. Once mastered, this technique can be done in several days if it is performed properly. It depends on the amount of data and the data availability.
While attempting this procedure, it is important to remember that the validity of the results depends on the quality and relevance of the chosen data. After watching this video, you should have a good understanding of how to select an appropriate animal model for a specific human disease based on transcript data.