Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.
Method Article
Existing algorithms generate one solution for a biomarker detection dataset. This protocol demonstrates the existence of multiple similarly effective solutions and presents a user-friendly software to help biomedical researchers investigate their datasets for the proposed challenge. Computer scientists may also provide this feature in their biomarker detection algorithms.
Biomarker detection is one of the more important biomedical questions for high-throughput 'omics' researchers, and almost all existing biomarker detection algorithms generate one biomarker subset with the optimized performance measurement for a given dataset. However, a recent study demonstrated the existence of multiple biomarker subsets with similarly effective or even identical classification performances. This protocol presents a simple and straightforward methodology for detecting biomarker subsets with binary classification performances, better than a user-defined cutoff. The protocol consists of data preparation and loading, baseline information summarization, parameter tuning, biomarker screening, result visualization and interpretation, biomarker gene annotations, and result and visualization exportation at publication quality. The proposed biomarker screening strategy is intuitive and demonstrates a general rule for developing biomarker detection algorithms. A user-friendly graphical user interface (GUI) was developed using the programming language Python, allowing biomedical researchers to have direct access to their results. The source code and manual of kSolutionVis can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.
Binary classification, one of the most commonly investigated and challenging data mining problems in the biomedical area, is used to build a classification model trained on two groups of samples with the most accurate discrimination power1,2,3,4,5,6,7. However, the big data generated in the biomedical field has the inherent "large p small n" paradigm, with the number of features usually much larger than the number of samples6,8,9. Therefore, biomedical researchers have to reduce the feature dimension before utilizing the classification algorithms to avoid the overfitting problem8,9. Diagnosis biomarkers are defined as a subset of detected features separating patients of a given disease from healthy control samples10,11. Patients are usually defined as the positive samples, and the healthy controls are defined as the negative samples12.
Recent studies have suggested that there exists more than one solution with identical or similarly effective classification performances for a biomedical dataset5. Almost all the feature selection algorithms are deterministic algorithms, producing only one solution for the same dataset. Genetic algorithms may simultaneously generate multiple solutions with similar performances, but they still try to select one solution with the best fitness function as the output for a given dataset13,14.
Feature selection algorithms can be roughly grouped as either filters or wrappers12. A filter algorithm chooses the top-k features ranked by their significant individual association with the binary class labels based on the assumption that features are independent of each other15,16,17. Although this assumption does not hold true for almost all real-world datasets, the heuristic filter rule performs well in many cases, for instance, the mRMR (Minimum Redundancy and Maximum Relevance) algorithm, the Wilcoxon test based feature filtering (WRank) algorithm, and the ROC (Receiver operating characteristic) plot based filtering (ROCRank) algorithm. mRMR, is an efficient filter algorithm because it approximates the combinatorial estimation problem with a series of much smaller problems, comparing to the maximum-dependency feature selection algorithm, each of which only involves two variables, and therefore uses pairwise joint probabilities which are more robust18,19. However, mRMR may underestimate the usefulness of some features as it does not measure the interactions between features which can increase relevancy, and thus misses some feature combinations that are individually useless but are useful only when combined. The WRank algorithm calculates a non-parametric score of how discriminative a feature is between two classes of samples, and is known for its robustness for outliers20,21. Furthermore, the ROCRank algorithm evaluates how significant the Area Under the ROC Curve (AUC) of a particular feature is for the investigated binary classification performance22,23.
On the other hand, a wrapper evaluates the pre-defined classifier's performance of a given feature subset, iteratively generated by a heuristic rule, and creates the feature subset with the best performance measurement24. A wrapper generally outperforms a filter in the classification performance but runs slower25. For example, the Regularized Random Forest (RRF)26,27 algorithm uses a greedy rule, by evaluating the features on a subset of the training data at each random forest node, whose feature importance scores are evaluated by the Gini index. The choice of a new feature will be penalized if its information gain does not improve that of the chosen features. Additionally, the Prediction Analysis for Microarrays (PAM)28,29 algorithm, also a wrapper algorithm, calculates a centroid for each of the class labels, and then selects features to shrink the gene centroids toward the overall class centroid. PAM is robust for outlying features.
Multiple solutions with the top classification performance may be necessary for any given dataset. Firstly, the optimization goal of a deterministic algorithm is defined by a mathematical formula, e.g., minimum error rate30, which is not necessarily ideal for biological samples. Secondly, a dataset may have multiple, significantly different, solutions with similar effective or even identical performances. Almost all existing feature selection algorithms will randomly select one of these solutions as the output31.
This study will introduce an informatics analytic protocol for generating multiple feature selection solutions with similar performances for any given binary classification dataset. Considering that most biomedical researchers are not familiar with informatic techniques or computer coding, a user-friendly graphical user interface (GUI) was developed to facilitate the rapid analysis of biomedical binary classification datasets. The analytic protocol consists of data loading and summarizing, parameter tuning, pipeline execution, and result interpretations. With a simple click, the researcher is able to generate the biomarker subsets and publication-quality visualization plots. The protocol has been tested using the transcriptomes of two binary classification datasets of Acute Lymphoblastic Leukemia (ALL), i.e., ALL1 and ALL212. The datasets of ALL1 and ALL2 were downloaded from the Broad Institute Genome Data Analysis Center, available at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. ALL1 contains 128 samples with 12,625 features. Of these samples, 95 are B-cell ALL and 33 are T-cell ALL. ALL2 includes 100 samples with 12,625 features as well. Of these samples, there are 65 patients that suffered relapse and 35 patients that did not. ALL1 was an easy binary classification dataset, with a minimum accuracy of four filters and four wrappers being 96.7%, and 6 of the 8 feature selection algorithms achieving 100%12. While ALL2 was a more difficult dataset, with the above 8 feature selection algorithms achieving no better than 83.7% accuracy12. This best accuracy was achieved with 56 features detected by the wrapper algorithm, Correlation-based Feature Selection (CFS).
NOTE: The following protocol describes the details of the informatics analytic procedure and pseudo-codes of the major modules. The automatic analysis system was developed using Python version 3.6.0 and the Python modules pandas, abc, numpy, scipy, sklearn, sys, PyQt5, sys, mRMR, math and matplotlib. The materials used in this study are listed in the Table of Materials.
1. Prepare the Data Matrix and Class Labels
2. Load the Data Matrix and Class Labels
3. Summarize and Display the Baseline Statistics of the Dataset
4. Determine the Class Labels and the Number of Top-ranked Features
5. Tune System Parameters for Different Performances
6. Run the Pipeline and Produce the INTERACTIVE VISUALIZED RESULTS
7. Interpret the 3D Scatter Plots-Visualize and Interpret the Feature Subsets with Similarly Effective Binary Classification Performances Using 3D Scatter Plots
8. Find Gene Annotations and Their Associations with Human Diseases
NOTE: Steps 8 to 10 will illustrate how to annotate a gene from the sequence level of both DNA and protein. Firstly, the gene symbol of each biomarker ID from the above steps will be retrieved from the database DAVID32, and then two representative web servers will be used to analyze this gene symbol from the levels of DNA and protein, respectively. The server GeneCard provides a comprehensive functional annotation of a given gene symbol, and the Online Mendelian Inheritance in Man database (OMIM) provides the most comprehensive curation of disease-gene associations. The server UniProtKB is one of the most comprehensive protein database, and the server Group-based Prediction System (GPS) predicts the signaling phosphorylation’s for a very large list of kinases.
9. Annotate the Encoded Proteins and the Post-Translational Modifications
10. Annotate Protein-Protein Interactions and Their Enriched Functional Modules
11. Export the Generated Biomarker Subsets and the Visualization Plots
The goal of this workflow (Figure 6) is to detect multiple biomarker subsets with similar efficiencies for a binary classification dataset. The whole process is illustrated by two example datasets ALL1 and ALL2 extracted from a recently-published biomarker detection study12,48. A user may install kSolutionVis by following the instructions in the supplementary materials.
This study presents an easy-to-follow multi-solution biomarker detection and characterization protocol for a user-specified binary classification dataset. The software puts an emphasis on user-friendliness and flexible import/export interfaces for various file formats, allowing a biomedical researcher to investigate their dataset easily using the GUI of the software. This study also highlights the necessity of generating more than one solution with similarly effective modeling performances, previously ignored by many exi...
We have no conflicts of interest related to this report.
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400) and the startup grant from Jilin University. Anonymous reviewers and biomedical testing users were appreciated for their constructive comments on improving the usability and functionality of kSolutionVis.
Name | Company | Catalog Number | Comments |
Hardware | |||
laptop | Lenovo | X1 carbon | Any computer works. Recommended minimum configuration: 1GB extra hard disk space, 1 GB memory, 2.0MHz CPU |
Name | Company | Catalog Number | Comments |
Software | |||
Python 3.0 | WingWare | Wing Personal | Any python programming and running environments support Python version 3.0 or above |
Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE
Zapytaj o uprawnieniaThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone