This computational protocol is significant because it allows a work to investigate associations between cellular components, for example, mitochondria proteins and their associations with disease, as reported in biomedical publications. CaseOLAP LIFT empowers investigators to extract and integrate information from biomedical reports and knowledge bases. Organized as a knowledge graph, these results can be leveraged to predict new relationships.
These research findings support hypothesis generation by highlighting a prioritized list of identified and predicted protein disease associations, useful for uncovering new insights into disease pathology and therapeutic. This highly customizable workflow can be applied to any cellular component via their GO term to any list of diseases via their MeSH term within any publication date range. This user-friendly protocol minimizes the computational expertise required for analysis.
Software is released as a docker container, requiring only sufficient computational storage and resources to execute. To begin open the terminal window to download the CaseOLAP LIFT docker container, and type docker pull CaseOLAP slash CaseOLAP_LIFT latest. Create a directory that will store all the program data and output.
Start the docker container with the command shown on the screen, replacing PATH_TO_FOLDER as the full file path for the folder. To start the Elasticsearch within the container, open a new terminal window and type the command shown on the screen. Navigate to the CaseOLAP_LIFT folder.
Make sure that the download links and config slash knowledge_base_links. json are up to date and accurate for the latest version of each knowledge base resource. To determine the gene ontology or GO term, go to the website geneontology.
org, and find the identifiers for all the GO terms. Similarly, find the disease categories via Medical Subject Header or MeSH identifiers from the website shown on the screen. To execute the pre-processing module, indicate the user-defined studied GO terms using the dash C flag, the disease MeSH tree numbers using the dash D flag, and specify abbreviations with a dash A flag.
To execute the text mining module, type Python, space, CaseOLAP_LIFT. py, space, text_mining, and add the dash L flag to impute the topics of uncategorized documents, and the dash T flag to download the full text of the disease relevant documents. Ensure that the text mining results are in the result folder.
Indicate the text mining results to use for the analysis by specifying either analyze all proteins to include all the functionally related proteins, or analyze core proteins to include only the GO term related proteins. To identify the top proteins and pathways for each disease, the CaseOLAP scores are Z-score transformed within each disease category. Specify the dash Z flag to indicate a specified threshold score above which the proteins will be considered significant.
Review the analysis results and adjust as necessary. Open the file z_score_cutoff_table. csv to view the generated Z-score table that contains the number of proteins significant to each disease category.
This helps inform the user to select an appropriate Z-score threshold. Open the results folder and ensure that the required files, including the folder generated from pre-processing, are in the folder. Check for all proteins in core proteins folders.
To design the knowledge graph, include the MeSH disease tree with the include MeSH flag. The protein-protein interactions from string with include PPI flag, the shared Reactome pathways with include PW flag, and the transcription factor dependence from GRNdb GTEx with include TFD flag. Run the knowledge graph construction module by specifying analyze core proteins to only include the GO term related proteins.
To scale the edge weights, use scale Z-score for non-negative Z-scores instead of the default CaseOLAP scores. Check the output and ensure the knowledge graph files merged_edges. tsv and merged_nodes.
tsv files are present. Finally, type the command shown on the screen to run the knowledge graph prediction script for predicting the protein disease associations. This figure presents mitochondrial protein significant to each disease category.
The Z-score transformation was applied to the CaseOLAP scores within each category to identify significant proteins using a threshold of three. The total number of proteins significant to each disease category is shown above each violin plot. The Reactome pathway analysis of these proteins revealed 12 pathways significant to all the diseases.
An example of applying deep learning to a disease specific knowledge graph is presented in this figure. The hidden relationships between the proteins and the disease are predicted, and the computed probabilities for both predictions are displayed here with values ranging from zero to one, where one indicates a strong prediction. The specified sequence is crucial for the execution of this protocol, particularly the pre-processing and text mining modules.
These two steps directly influence the identification of top proteins and pathways for each disease, as well as construction for the disease specific knowledge graph. The resulting knowledge graph is effectively visualized by graph tools, such as Neo4j and Cytoscape, and can be leveraged for advanced deep learning predictions of new relationships. CaseOLAP LIFT enables the study of associations between any cellular component and disease categories.
The resulting knowledge graph and ranked protein disease associations supports natural language processing and followup graph-based analyses.