Our protocol provides a step-by-step measure for building a cloud-based phrase mining platform for user-defined entity category association, to evaluate the association of proteins, genomes or chemicals with specific diseases. The main advantages of this technique are its improved efficiency over manual entity category association evaluation, enhanced accessibility and use of phrase mining tools for widespread biomedical research applications. Users can select entities and categories of interest within biomedical publications, or within text documents associated with specific keywords.
New users can follow our protocol and the references provided in the manuscript, and they can raise technical issues within our GitHub repository. Visual demonstration of this matter adds more clarity to how to perform the protocol, and encourages the implementation of novel text mining tools. To create a text-cube, first download the latest available medical subject headings, or mesh tree.
The code for mesh tree 2018 is MESHTree2018. bin, and should be entered into the input directory. Define the categories of interest using one or more mesh descriptors, and collect mesh IDs for a category.
Save the names of the categories in the textcube_config. json file in the config directory, and add the collected categories of the mesh IDs in a line separated by a space. Save the category file as categories.
txt in the input directory. This algorithm automatically selects all descendant mesh descriptors. Make sure that mesh2pmid.
json is in the data directory. If the mesh tree has been updated with a different name in the input directory, make sure that this is properly represented in the input data path in the run_textcube. py file.
To create a document structure called text-cube, enter python run_textcube. py in the terminal to create a collection of documents for each category. A single document may fall under multiple categories.
Once the text-cube creation step has been completed, make sure a cell to the PMID table is saved in the data directory as textcube_cell2pmid.json. A PMID to the cell mapping table is saved in the data directory as textcube_pmid2cell.json. A collection of all descendant mesh terms for a cell is saved in the data directory as meshterms_per_cat.json.
And the text-cube data statistics are saved in the data directory as textcube_stat.txt. Then, go to the log directory to read the log messages in textcube_log. txt, in case this process fails.
If the process is completed successfully, the debugging messages of the text-cube creation will be printed out in the log file. For an entity count, create user-defined entities, placing one entity and its abbreviations in a single line, separated by the vertical line symbol. Save the entity file as entities.
txt in the input directory, and make sure that the Elasticsearch server is running. If an indexed database called PubMed is present in the Elasticsearch server, confirm the presence of the textcube_pmid2cell. json file in the data directory, and enter python run_entitycount.
py in the terminal to perform an entity count operation. When all of the documents from the index database, and the number of entities in each document have been counted, and the PMIDs in which entities were found have been collected, save the final results as entitycount. txt and entityfound_pmid2cell.
json in the data directory. Then, open the log directory to read the log messages in the entitycount_log. txt, in case this process fails.
If the process is completed successfully, the debugging messages of the entity count will be printed out in the log file. Make sure that all input data are in the data directory. These are the input data for the metadata update.
To prepare a collection of metadata, enter python run_metadata_update. py in the terminal to update the metadata. Once the metadata update is complete, make sure that the metadata_pmid2pcount.
json and metadata_cell2pmid. json files are saved in the data directory. Go to the log directory to read the log messages in the metadata_update_log.
txt file, in case this process fails. If the process is completed successfully, the debugging messages of the metadata update will be printed out in the log file. For context-aware semantic online analytical processing score calculation, confirm the presence of the metadata_pmid2pcount.
json and metadata_cell2pmid. json files in the data directory. These are the input data for the score calculation.
Enter python run_caseolap_score. py in the terminal to perform a context-aware semantic online analytical processing score calculation of the entities based on user-defined categories. The score is the product of integrity, popularity and distinctiveness.
Once the score computation is complete, confirm that the results are saved in the result directory. Then, access the log directory to read the log messages in the caseolab_score_log. txt file, in case this process fails.
If the process is completed successfully, the debugging messages of the caseolab score calculation will be printed out in the log file. Using the obtained metadata and statistics from the four infant, child, adolescent and adult age group subcategories, a comparison of the number of documents among the text-cube cells can be displayed. Here, the adult subcategory contains the highest number across all cells, with the adult and adolescent subcategories having the highest number of shared documents, and containing the entity of interest for this representative analysis.
Assessment of the protein age group association as a context-aware semantic online analytical processing score, the top 10 proteins associated with the infant, child, adolescent and adult subcategories were able to be determined. Here, obtained metadata and statistics for the nutritional and metabolic diseases subcategories are shown. The subcategory metabolic disease contains almost three times as many documents as the nutritional disorders subcategory.
The metabolic disease and nutritional disorders subcategories have 7, 101 shared documents. Notably, these documents included the entity of interest for the representative study. More than half of all of the proteins are shared between the subcategories, with almost half of all of the associated proteins in the metabolic disease subcategory unique to that subcategory, and with the nutritional disorders subcategory exhibiting only a few unique proteins.
Independent and distinct categories, and a collection of all of the synonyms and abbreviations of an entity will provide the best results. Since entity category association is presented as a numerical value, this opens the door to implementing missing learning techniques such as clustering and principle component analysis. This technique facilitates the discovery of hidden or previously unidentified relationships within these associations, paving the way for a deeper understanding of biological processes.