A subscription to JoVE is required to view this content. Sign in or start your free trial.
Here, we introduce a protocol for converting transcriptomic data into a mqTrans view, enabling the identification of dark biomarkers. While not differentially expressed in conventional transcriptomic analyses, these biomarkers exhibit differential expression in the mqTrans view. The approach serves as a complementary technique to traditional methods, unveiling previously overlooked biomarkers.
Transcriptome represents the expression levels of many genes in a sample and has been widely used in biological research and clinical practice. Researchers usually focused on transcriptomic biomarkers with differential representations between a phenotype group and a control group of samples. This study presented a multitask graph-attention network (GAT) learning framework to learn the complex inter-genic interactions of the reference samples. A demonstrative reference model was pre-trained on the healthy samples (HealthModel), which could be directly used to generate the model-based quantitative transcriptional regulation (mqTrans) view of the independent test transcriptomes. The generated mqTrans view of transcriptomes was demonstrated by prediction tasks and dark biomarker detection. The coined term "dark biomarker" stemmed from its definition that a dark biomarker showed differential representation in the mqTrans view but no differential expression in its original expression level. A dark biomarker was always overlooked in traditional biomarker detection studies due to the absence of differential expression. The source code and the manual of the pipeline HealthModelPipe can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.
Transcriptome consists of the expressions of all the genes in a sample and may be profiled by high-throughput technologies like microarray and RNA-seq1. The expression levels of one gene in a dataset are called a transcriptomic feature, and the differential representation of a transcriptomic feature between the phenotype and control groups defines this gene as a biomarker of this phenotype2,3. Transcriptomic biomarkers have been extensively utilized in the investigations of disease diagnosis4, biological mechanism5, and survival analysis6,7, etc.
Gene activity patterns in the healthy tissues carry crucial information about the lives8,9. These patterns offer invaluable insights and act as ideal references for understanding the complex developmental trajectories of benign disorders10,11 and lethal diseases12. Genes interact with each other, and transcriptomes represent the final expression levels after their complicated interactions. Such patterns are formulated as transcriptional regulation network13 and metabolism network14, etc. The expressions of messenger RNAs (mRNAs) can be transcriptionally regulated by transcription factors (TFs) and long intergenic non-coding RNAs (lincRNAs)15,16,17. Conventional differential expression analysis ignored such complex gene interactions with the assumption of inter-feature independence18,19.
Recent advancements in graph neural networks (GNNs) demonstrate extraordinary potential in extracting important information from OMIC-based data for cancer studies20, e.g., identifying co-expression modules21. The innate capacity of GNNs renders them ideal for modeling the intricate relationships and dependencies among genes22,23.
Biomedical studies often focus on accurately predicting a phenotype against the control group. Such tasks are commonly formulated as binary classifications24,25,26. Here, the two class labels are typically encoded as 1 and 0, true and false, or even positive and negative27.
This study aimed to provide an easy-to-use protocol for generating the transcriptional regulation (mqTrans) view of a transcriptome dataset based on the pre-trained graph-attention network (GAT) reference model. The multitask GAT framework from a previously published work26 was used to transform transcriptomic features to the mqTrans features. A large dataset of healthy transcriptomes from the University of California, Santa Cruz (UCSC) Xena platform28 was used to pre-train the reference model (HealthModel), which quantitatively measured the transcription regulations from the regulatory factors (TFs and lincRNAs) to the target mRNAs. The generated mqTrans view could be used to build prediction models and detect dark biomarkers. This protocol utilizes the colon adenocarcinoma (COAD) patient dataset from The Cancer Genome Atlas (TCGA) database29 as an illustrative example. In this context, patients in stages I or II are categorized as negative samples, while those in stages III or IV are considered positive samples. The distributions of dark and traditional biomarkers across the 26 TCGA cancer types are also compared.
Description of the HealthModel pipeline
The methodology employed in this protocol is based on the previously published framework26, as outlined in Figure 1. To commence, users are required to prepare the input dataset, feed it into the proposed HealthModel pipeline, and obtain mqTrans features. Detailed data preparation instructions are provided in section 2 of the protocol section. Subsequently, users have the option to combine mqTrans features with the original transcriptomic features or proceed with the generated mqTrans features only. The produced dataset is then subjected to a feature selection process, with users having the flexibility to choose their preferred value for k in k-fold cross-validation for classification. The primary evaluation metric utilized in this protocol is accuracy.
HealthModel26 categorizes the transcriptomic features into three distinct groups: TF (Transcription Factor), lincRNA (long intergenic non-coding RNA), and mRNA (messenger RNA). The TF features are defined based on the annotations available in the Human Protein Atlas30,31. This work utilizes the annotations of lincRNAs from the GTEx dataset32. Genes belonging to the third-level pathways in the KEGG database33 are considered as mRNA features. It is worth noting that if an mRNA feature exhibits regulatory roles for a target gene as documented in the TRRUST database34, it is reclassified into the TF class.
This protocol also manually generates the two example files for the gene IDs of regulatory factors (regulatory_geneIDs.csv) and target mRNA (target_geneIDs.csv). The pairwise distance matrix among the regulatory features (TFs and lincRNAs) is calculated by the Pearson correlation coefficients and clustered by the popular tool weighted gene co-expression network analysis (WGCNA)36 (adjacent_matrix.csv). Users can directly utilize the HealthModel pipeline together with these example configuration files to generate the mqTrans view of a transcriptomic dataset.
Technical details of HealthModel
HealthModel represents the intricate relationships among TFs and lincRNAs as a graph, with the input features serving as the vertices denoted by V and an inter-vertex edge matrix designated as E. Each sample is characterized by K regulatory features, symbolized as VK×1. Specifically, the dataset encompassed 425 TFs and 375 lincRNAs, resulting in a sample dimensionality of K = 425 + 375 = 800. To establish the edge matrix E, this work employed the popular tool WGCNA35. The pairwise weight linking two vertices represented as and
, is determined by the Pearson correlation coefficient. The gene regulatory network exhibits a scale-free topology36, characterized by the presence of hub genes with pivotal functional roles. We compute the correlation between two features or vertices,
and
, using the topological overlap measure (TOM) as follows:
(1)
(2)
The soft threshold β is calculated using the 'pickSoft Threshold' function from the WGCNA package. The power exponential function aij is applied, where represents a gene excluding i and j, and
represents the vertex connectivity. WGCNA clusters the expression profiles of the transcriptomic features into multiple modules using a commonly employed dissimilarity measure (
37.
The HealthModel framework was originally designed as a multitask learning architecture26. This protocol only utilizes the model pre-training task for the construction of the transcriptomic mqTrans view. The user may choose to further refine the pre-trained HealthModel under the multitask graph attention network with additional task-specific transcriptomic samples.
Technical details of feature selection and classification
The feature selection pool implements eleven feature selection (FS) algorithms. Among them, three are filter-based FS algorithms: selecting K best features using the Maximal Information Coefficient (SK_mic), selecting K features based on the FPR of MIC (SK_fpr), and selecting K features with the highest false discovery rate of MIC (SK_fdr). Additionally, three tree-based FS algorithms assess individual features using a decision tree with the Gini index (DT_gini), adaptive boosted decision trees (AdaBoost), and random forest (RF_fs). The pool also incorporates two wrapper methods: Recursive feature elimination with the linear support vector classifier (RFE_SVC) and recursive feature elimination with the logistic regression classifier (RFE_LR). Finally, two embedding algorithms are included: linear SVC classifier with the top-ranked L1 feature importance values (lSVC_L1) and logistic regression classifier with the top-ranked L1 feature importance values (LR_L1).
The classifier pool employs seven different classifiers to build classification models. These classifiers comprise linear support vector machine (SVC), Gaussian Naïve Bayes (GNB), logistic regression classifier (LR), k-nearest neighbor, with k set to 5 by default (KNN), XGBoost, random forest (RF), and decision tree (DT).
The random split of the dataset into the train: test subsets can be set in the command line. The demonstrated example uses the ratio of train: test = 8: 2.
NOTE: The following protocol describes the details of the informatics analytic procedure and Python commands of the major modules. Figure 2 illustrates the three major steps with example commands utilized in this protocol and refer to previously published works26,38 for more technical details. Do the following protocol under a normal user account in a computer system and avoid using the administrator or root account. This is a computational protocol and has no biomedical hazardous factors.
1. Prepare Python environment
2. Using the pre-trained HealthModel to generate the mqTrans features
3. Select mqTrans Features
Evaluation of the mqTrans view of the transcriptomic dataset
The test code uses eleven feature selection (FS) algorithms and seven classifiers to evaluate how the generated mqTrans view of the transcriptomic dataset contributes to the classification task (Figure 6). The test dataset consists of 317 colon adenocarcinoma (COAD) from The Cancer Genome Atlas (TCGA) database29. The COAD patients at stages I or II are regarded as the negative samples,...
Section 2 (Use the pre-trained HealthModel to generate the mqTrans features) of the protocol is the most critical step within this protocol. After preparing the computational working environment in section 1, section 2 generates the mqTrans view of a transcriptomic dataset based on the pre-trained large reference model. Section 3 is a demonstrative example of selecting the generated mqTrans features for biomarker detections and prediction tasks. The users can conduct other transcriptomic analyses on this mqTrans dataset ...
The authors have nothing to disclose.
This work was supported by the Senior and Junior Technological Innovation Team (20210509055RQ), Guizhou Provincial Science and Technology Projects (ZK2023-297), the Science and Technology Foundation of Health Commission of Guizhou Province (gzwkj2023-565), Science and Technology Project of Education Department of Jilin Province (JJKH20220245KJ and JJKH20220226SK), the National Natural Science Foundation of China (U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), and the Fundamental Research Funds for the Central Universities, JLU. We extend our sincerest appreciation to the review editor and the three anonymous reviewers for their constructive critiques, which have been instrumental in substantially enhancing the rigor and clarity of this protocol.
Name | Company | Catalog Number | Comments |
Anaconda | Anaconda | version 2020.11 | Python programming platform |
Computer | N/A | N/A | Any general-purpose computers satisfy the requirement |
GPU card | N/A | N/A | Any general-purpose GPU cards with the CUDA computing library |
pytorch | Pytorch | version 1.13.1 | Software |
torch-geometric | Pytorch | version 2.2.0 | Software |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved