A subscription to JoVE is required to view this content. Sign in or start your free trial.
This protocol describes a useful tool for identifying significant molecular changes in cancer and leads to the development of new diagnostic and therapeutic approaches for esophageal squamous cell carcinoma.
Esophageal cancer (EC) ranks as the 8th most aggressive malignancy, and its treatment remains challenging due to the lack of biomarkers facilitating early detection. EC manifests in two major histological forms - adenocarcinoma (EAD) and squamous cell carcinoma (ESCC) - both exhibiting variations in incidence across geographically distinct populations. High-throughput technologies are transforming the understanding of diseases, including cancer. A significant challenge for the scientific community is dealing with scattered data in the literature. To address this, a simple pipeline is proposed for the analysis of publicly available microarray datasets and the collection of differentially regulated molecules between cancer and normal conditions. The pipeline can serve as a standard approach for differential gene expression analysis, identifying genes differentially expressed between cancer and normal tissues or among different cancer subtypes. The pipeline involves several steps, including Data preprocessing (involving quality control and normalization of raw gene expression data to remove technical variations between samples), Differential expression analysis (identifying genes differentially expressed between two or more groups of samples using statistical tests such as t-tests, ANOVA, or linear models), Functional analysis (using bioinformatics tools to identify enriched biological pathways and functions in differentially expressed genes), and Validation (involving validation using independent datasets or experimental methods such as qPCR or immunohistochemistry). Using this pipeline, a collection of differentially expressed molecules (DEMs) can be generated for any type of cancer, including esophageal cancer. This compendium can be utilized to identify potential biomarkers and drug targets for cancer and enhance understanding of the molecular mechanisms underlying the disease. Additionally, population-specific screening of esophageal cancer using this pipeline will help identify specific drug targets for distinct populations, leading to personalized treatments for the disease.
It is alarming that EC is the eighth most common cancer worldwide and the sixth leading cause of death worldwide. China, India, and Iran have alarmingly high incidence and mortality rates. There are two main types of EC: esophageal adenocarcinoma (EAC or EAD), and esophageal squamous cell carcinoma (ESCC)1. EAC is more common in the Western world, whereas ESCC is more common in Eastern countries, especially China and Iran2. Several risk factors are associated with EC, including tobacco and alcohol use, obesity, and gastroesophageal reflux disease (GERD). Additionally, dietary factors such as lack of fruits and vegetables and consumption of hot drinks and foods are associated with ESCC risk in high-risk areas. Early diagnosis and treatment are important for improving the outcomes of patients with EC3,4. Therefore, it is important to raise awareness of the risk factors, signs, and symptoms of EC, and to encourage regular screening of high-risk individuals. Furthermore, efforts to address modifiable risk factors, such as tobacco and alcohol use and unhealthy dietary habits, may help reduce the incidence of EC. EAD occurs in the cells of mucus-producing glands in the lower part of the esophagus, near the stomach. It is often associated with GERD, in which stomach acid and contents return into the esophagus. In contrast, ESCC arises from flat, thin cells that line the upper part of the esophagus5. It is more common in areas where tobacco and alcohol use are widespread, such as China and Iran.
Among various conditions related to the esophagus, Barrett's esophagus (BE), a condition in which the lining of the esophagus is replaced by glandular cells, is a known precursor of EAC6. It is worth noting that BE can develop without GERD, but the presence of GERD increases the risk of developing BE by 3 to 5-fold. Additionally, the presence of BE increases the risk of developing EAC by 50-100 fold7. Furthermore, hot or spicy foods and liquids have been linked to ESCC, but not to EAC. Understanding the risk factors for EC is important for it's prevention and early detection. Efforts to address modifiable risk factors, such as tobacco use, alcohol consumption, obesity, and unhealthy dietary habits, may help reduce the incidence of EC. Furthermore, routine screening and surveillance for high-risk individuals, such as those with dysphagia, or BE, may improve outcomes by enabling early detection and treatment.
It is certainly true that omics-driven studies, including genomics, transcriptomics, proteomics, methylomics, miRNAomics, and metabolomics, have contributed greatly to our understanding of ECs, especially ESCC8,9,10,11,12,13. These studies have allowed the identification of novel biomarkers, potential therapeutic targets, and new pathways involved in the development and progression of ESCC. However, the data generated from these studies is scattered throughout the literature, making it difficult for the scientific community to access and use this information. Therefore, it is important to create a repository or database that compiles data obtained from high- or low-throughput studies on specific cancers. Such a package can be streamlined and made by implementing some basic guidelines. These guidelines include selecting relevant studies, extracting and organizing data from these studies, and ensuring data quality and consistency. In addition, the compendium should be updated regularly to include new studies and data as they become available. Researchers can use a single platform to retrieve and analyze data on a specific cancer by creating a compendium or database that combines data from different studies. This will help accelerate research efforts and ultimately lead to more effective treatments and better outcomes for cancer patients.
The development of the cancer compendium incorporates data from both low-throughput and high-throughput studies. This compendium will be a valuable resource for researchers looking to identify potential diagnostic or therapeutic targets for cancer. One way to build this collection is by reviewing microarray studies available in publicly accessible repositories such as Gene Expression Omnibus (GEO). Microarray studies can provide information about gene expression levels in cancer cells, and these data can be used to identify differentially expressed genes (DEGs) that may play a role in cancer development and progression.
However, it should be noted that different studies might have used different methods to analyze their data, which may have led to the identification of different DEGs. Therefore, it is important to carefully review each study and consider any potential bias or limitations when pooling data for the compendium. Once the data is gathered at a common platform, researchers can use it to identify potential molecular targets for further study. These include examining the expression of a particular gene in clinical samples or conducting mechanistic studies to understand how a particular gene or protein is involved in cancer development and progression. Overall, the creation of a cancer data set will be a valuable resource for cancer researchers and help identify new targets for diagnosis and therapeutic interventions.
1. Manual curation of the differentially regulated molecules in ESCC
2. Finding relevant studies using PubMed
3. Finding relevant studies using gene expression omnibus (GEO)
NOTE: Gene expression omnibus (GEO) is a freely available repository for storing data on DNA microarrays. The plethora of data available in GEO is a good resource for data mining to identify differentially regulated molecules between cancer/diseases versus normal conditions.
4. Microarray analysis using GEO2R
NOTE: The first thing is to find relevant studies using Boolean operators (AND, OR, NOT). These will be used in combination with the keywords 'esophageal squamous cell carcinoma', 'ESCC', or 'oesophageal squamous cell carcinoma'. GEO2R (see Table of Materials) is a freely available R-language package that is integrated with GEO, enabling users to analyze data from microarray studies in a user-friendly manner. It interacts with GEO entry IDs and provides an interface for performing complex R-based analysis to identify DEGs using Bioconductor R packages for the back end. This package not only transforms the GEO data but also presents its output in form of .txt tables, which can be further modified according to the users' needs16. The GEO2R package presents genes in an order of statistical significance based on p-value, but the order can be sorted based on log2-fold change. Additionally, users can view gene expression profiles as GEO profile images. Unlike other analysis tools, GEO2R is independent of selected dataset records and can interrogate actual data submitted by the investigators directly. More than 90% of GEO studies can be analyzed using this method17. The workflow of GEO2R with steps involved in analysis of microarray data using GEO2R is shown in Figure 1.
5. Finding alias for a gene/protein
6. Finding official gene symbol for the DEGs
7. Finding gene locus of the DEGs
8. Finding information about DEGs on OMIM Pagegene locus of the DEGs
9. Finding protein localization, domain, and motif, and secretory nature of the protein encoded by the gene
10. Cherry picking for the protein for validation, and further assessment for diagnosis or prognosis of the malignancy of interest
NOTE: Once unique molecules are identified, the biggest challenge is how to validate them. Usually, microarray study provides expression at the mRNA levels, but for disease diagnosis or prognosis, readout of protein levels is crucial. For the same, patients' or patients derived samples or cell lines of same cancer must be screened to know if the molecule is actually expressed there and if it is able to discriminate between cancer vs. normal, or good vs. bad prognosis, or differentiate between early to late stages of the diseases. To validate the candidate molecule, Western blot, enzyme-linked immunosorbent assays i.e., ELISA, immunoprecipitation, immunohistochemistry, immunocytochemistry, or assay are useful techniques18,19,20. At the same time, all these assays require, antibodies to detect the antigen present in the samples. Antibody is costly items, so it's always better to select antibodies based on the following points:
As an example, GEO accession GSE161533 was used to study differentially explored genes in ESCC. The representative results of the analysis have been shown in the Figure 3. GEO2R generates a volcano plot that is useful for identifying events that differ significantly between two groups of experimental subjects. Volcano plot presents overall gene distribution with -log10 transformed significance (p-value) on the y-axis, and fold changes (with log2 transformed fol...
Since the involvement of high-throughput OMICS techniques in cancer biology, the rate of generation of data has been significantly increased. This poses a challenge for researchers especially those without a computer-savvy nature. To overcome over the years bioinformaticians come up with the idea of developing a database to provide data in an organized manner. This generated a positive response from researchers, especially those who are not interested in technology. Furthermore, scattered OMICS data here and there in the...
The authors have nothing to disclose.
MKK is recipient of the TARE fellowship (Grant # TAR/2018/001054) extramural grant (Grant # 5/13/55/2020/NCD-III) from the Science and Engineering Research Board (SERB), Department of Science and Technology, and the Indian Council of Medical Research (ICMR), Government of India, New Delhi, respectively.
Name | Company | Catalog Number | Comments |
NCBI-PUBMED | NCBI | https://ncbi.nlm.nih.gov/pubmed | Referring to section 1. required for searching the literature |
A laptop/macbook or personal computer with internet facility and a web browser. | |||
g:Profiler | ELIXIR infrastructure | https://biit.cs.ut.ee/gprofiler/gost | Referring to section 4.10. required for enrichment of GO:MF, GO:BP, and GO:CC |
Gene expression omnibus | NCBI | https://www.ncbi.nlm.nih.gov/geo/ | Referring to section 3.1. required for searching the microarray study database |
GEO2R | NCBI | https://www.ncbi.nlm.nih.gov/geo/geo2r/ | Referring to section 3.2. required for analyzing the data using GEO2R tool |
https://www.google.com | Referring to section 1.1. required for searching the literature | ||
HGNC | HGNC is a committee of the Human Genome Organisation (HUGO) | https://www.genenames.org | Referring to section 6.1 required to know the official gene symbol of the DEGs |
HPRD | Institute of Bioinformatics, Bangluru | http://hprd.org | Referring to section 5.1 required for informationn about protein architecture |
OMIM | Johns Hopkins University, Baltimore | http://www.omim.org/entry | Referring to section 8.1 required to know the OMIM ID of a particular gene / DEG |
Pangloss Program | Developed by Chris Seidel | http://www.pangloss.com/seidel/Protocols/venn.cgi | Referring to section 4.9. required for generating the Venn diagram |
PANTHER | Thomas lab at the University of Southern California | http://www.pantherdb.org/geneListAnalysis.do | Referring to section 4.10. required for enrichment of GO:MF, GO:BP, and GO:CC |
ShinyGO | South Dakota State University | http://bioinformatics.sdstate.edu/go | Referring to section 4.10. required for allocation of DEGs on the chromosomes |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved