A subscription to JoVE is required to view this content. Sign in or start your free trial.
Method Article
The protocol described here provides detailed instructions on how to analyze genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser. Additionally, several tools and resources are recommended to further investigate sequence characteristics of identified microproteins to gain insight into their putative functions.
Next-generation sequencing (NGS) has propelled the field of genomics forward and produced whole genome sequences for numerous animal species and model organisms. However, despite this wealth of sequence information, comprehensive gene annotation efforts have proven challenging, especially for small proteins. Notably, conventional protein annotation methods were designed to intentionally exclude putative proteins encoded by short open reading frames (sORFs) less than 300 nucleotides in length to filter out the exponentially higher number of spurious noncoding sORFs throughout the genome. As a result, hundreds of functional small proteins called microproteins (<100 amino acids in length) have been incorrectly classified as noncoding RNAs or overlooked entirely.
Here we provide a detailed protocol to leverage free, publicly available bioinformatic tools to query genomic regions for microprotein-coding potential based on evolutionary conservation. Specifically, we provide step-by-step instructions on how to examine sequence conservation and coding potential using Phylogenetic Codon Substitution Frequencies (PhyloCSF) on the user-friendly University of California Santa Cruz (UCSC) Genome Browser. Additionally, we detail steps to efficiently generate multiple species alignments of identified microprotein sequences to visualize amino acid sequence conservation and recommend resources to analyze microprotein characteristics, including predicted domain structures. These powerful tools can be used to help identify putative microprotein-coding sequences in noncanonical genomic regions or to rule out the presence of a conserved coding sequence with translational potential in a noncoding transcript of interest.
The identification of the complete set of coding elements in the genome has been a major goal since the initiation of the Human Genome Project, and remains a central objective toward the understanding of biological systems and the etiology of genetic-based diseases1,2,3,4. Advances in NGS techniques have led to the production of whole genome sequences for an extensive number of organisms, including vertebrates, invertebrates, yeast, and plants5. Additionally, high-throughput transcriptional sequencing methods have further revealed the complexity of the cellular transcriptome, and identified thousands of novel RNA molecules with both protein-coding and noncoding functions6,7. Decoding this vast amount of sequence information is an ongoing process, and challenges remain with comprehensive gene annotation efforts8.
The recent development of translational profiling methods, including ribosome profiling9,10 and poly-ribosome sequencing11, have provided evidence indicating that hundreds of noncanonical translation events map to currently unannotated sORFs throughout the genome, with the potential to generate small proteins called microproteins or micropeptides12,13,14,15,16,17. Microproteins have emerged as a novel class of versatile proteins previously overlooked by standard gene annotation methods due to their small size (<100 amino acids) and lack of classical protein-coding gene characteristics8,12,18,19,20. Microproteins have been described in virtually all organisms, including yeast21,22, flies17,23,24, and mammals25,26,27,28, and have been shown to play critical roles in diverse processes, including development, metabolism, and stress signaling19,20,29,30,31,32,33,34. Thus, it is imperative to continue to mine the genome for additional members of this long-overlooked class of functional small proteins.
Despite the widespread recognition of the biological importance of microproteins, this class of genes remains vastly underrepresented in genome annotations, and their accurate identification continues to be an ongoing challenge that has hindered progress in the field. Various computational tools and experimental methods have recently been developed to overcome the difficulties associated with identifying microprotein-coding sequences (discussed extensively in several comprehensive reviews8,35,36,37). Many recent microprotein identification studies38,39,40,41,42,43,44,45,46,47 have relied heavily on the use of one such algorithm called PhyloCSF48,49, a powerful comparative genomics approach that can be leveraged to distinguish conserved protein-coding regions of the genome from those that are noncoding.
PhyloCSF compares codon substitution frequencies (CSF) using multi-species nucleotide alignments and phylogenetic models to detect evolutionary signatures of protein-coding genes. This empirical model-based approach relies on the premise that proteins are primarily conserved at the amino acid level rather than the nucleotide sequence. Therefore, synonymous codon substitutions, which encode the same amino acid, or codon substitutions to amino acids with conserved properties (i.e., charge, hydrophobicity, polarity) are scored positively, while non-synonymous substitutions, including missense and nonsense substitutions, score negatively. PhyloCSF is trained on whole-genome data and has proven to be effective in scoring short portions of a coding sequence (CDS) in isolation from the full sequence, which is necessary when analyzing microproteins or individual exons of standard protein-coding genes48,49.
Notably, the recent integration of the PhyloCSF track hubs in the University of California Santa Cruz (UCSC) Genome Browser49,50,51 enables investigators of all backgrounds to easily access a user-friendly interface to query genomic regions of interest for protein-coding potential. The protocol outlined below provides detailed instruction on how to load the PhyloCSF track hubs on the UCSC Genome Browser and subsequently interrogate genomic regions of interest to probe for high-confidence protein-coding regions (or the lack thereof). Additionally, in the case where a positive PhyloCSF score is observed, steps are delineated to further analyze microprotein-coding potential and efficiently generate multiple species alignments of the identified amino acid sequences to illustrate cross-species sequence conservation. Lastly, several additional publicly available resources and tools are introduced in the discussion to survey identified microprotein characteristics, including predicted domain structures and insight into putative microprotein function.
The protocol outlined below details steps to load and navigate the PhyloCSF browser tracks on the UCSC Genome Browser (generated by Mudge et al.49). For general questions regarding the UCSC Genome Browser, an extensive Genome Browser User's Guide can be found here: https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html.
1. Loading the PhyloCSF Track Hub to the UCSC Genome Browser
2. Navigating to genes of interest using Gene Identifiers
3. Navigating to genomic regions of interest using sequence information
4. Identifying conserved sORFs using PhyloCSF Track Data
5. Viewing homologous regions in other genomes
6. Generating multi-species sequence alignments for microproteins of interest
Here we will use the validated microprotein mitoregulin (Mtln) as an example to demonstrate how a conserved sORF will generate a positive PhyloCSF score that can be easily visualized and analyzed on the UCSC Genome Browser. Mitoregulin was previously annotated as a noncoding RNA (formerly human gene ID LINC00116 and mouse gene ID 1500011K16Rik). Comparative genomics and sequence conservation analysis methods played a critical role in its initial discovery40,
The protocol presented here provides detailed instructions on how to interrogate genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser48,49,50,51. As detailed above, PhyloCSF is a powerful comparative genomics algorithm that integrates phylogenetic models and codon substitution frequencies to identify evolutionary signatures that a...
The authors declare that they have no competing financial interests.
This work was supported by grants from the National Institutes of Health (HL-141630 and HL-160569) and Cincinnati Children's Research Foundation (Trustee Award).
Name | Company | Catalog Number | Comments |
Website | Website Address | Requirements | |
Clustal Omega Multiple Sequence Alignment Tool | https://www.ebi.ac.uk/Tools/msa/clustalo/ | Web browser | Multiple sequence alignment program for the efficient alignment of FASTA sequences (i.e. for cross-species comparison of identified microproteins) |
COXPRESSdb | https://coxpresdb.jp | Web browser | Provides co-regulated gene relationships to estimate gene functions |
EMBL-EBI Bioinformatics Tools FAQs | https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ | Web browser | Frequently Asked Questions (FAQs) for EMBL-EBI tools. Includes the color coding key for protein sequence alignments |
European Bioinformatics Institute (EMBL-EBI), Tools and Data Resources | https://www.ebi.ac.uk/services/all | Web browser | Comprehensive list of freely available websites, tools and data resources |
Expasy - Swiss Bioinformatics Resource Portal | https://www.expasy.org | Web browser | Suite of bioinformatic tools and resources for protein sequence analysis that is maintained by the Swiss Institute of Bioinformatics (SIB) |
National Center for Biotechnology Information (NCBI) Conserved Domain Search | https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi | Web browser | Search tool to identify conserved domains within protein or coding nucleotide sequences |
Pfam 35 | http://pfam.xfam.org | Web browser | Protein family (Pfam) database, provides alignments and classification of protein families and domains |
PhyloCSF Track Hub Description | https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=1267045267_TEc99h2oW5Q edaCd4ir8aZ65ryaD&db=mm10 &c=chr2&g=hub_109801_ PhyloCSF_smooth | Web browser | Detailed description of the Smoothed PhyloCSF tracks and PhyloCSF Track Hub |
SignalP 6.0 | https://services.healthtech.dtu.dk/service.php?SignalP-6.0 | Web browser | Predicts the presence of signal peptides and the location of their cleavage sites |
TMHMM - 2.0 | https://services.healthtech.dtu.dk/service.php?TMHMM-2.0 | Web browser | Prediction of transmembrane helices in proteins |
UCSC Genome Browser BLAT Search | https://genome.ucsc.edu/cgi-bin/hgBlat | Web browser | Tool used to find genomic regions using DNA or protein sequence information |
UCSC Genome Browser Gateway | https://genome.ucsc.edu/cgi-bin/hgGateway | Web browser | Direct link to the UCSC Genome Browser Gateway |
UCSC Genome Browser Home | https://genome.ucsc.edu/ | Web browser | Home website for the UCSC Genome Browser |
UCSC Genome Browser Track Data Hubs | https://genome.ucsc.edu/cgi-bin/hgHubConnect#publicHubs | Web browser | Direct link to Track Data Hubs/Public Hubs database to search for and load the PhyloCSF Tracks |
UCSC Genome Browser User Guide | https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html | Web browser | Comprehensive user guide detailing how to navigate the UCSC Genome Browser |
WoLF PSORT | https://wolfpsort.hgc.jp | Web browser | Protein subcellular localization prediction tool |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved