An Integrated Approach for Microprotein Identification and Sequence Analysis

Omar Brito-Estrada; Keira R. Hassel; Catherine A. Makarewich

doi:10.3791/63841

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

The protocol described here provides detailed instructions on how to analyze genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser. Additionally, several tools and resources are recommended to further investigate sequence characteristics of identified microproteins to gain insight into their putative functions.

Abstract

Next-generation sequencing (NGS) has propelled the field of genomics forward and produced whole genome sequences for numerous animal species and model organisms. However, despite this wealth of sequence information, comprehensive gene annotation efforts have proven challenging, especially for small proteins. Notably, conventional protein annotation methods were designed to intentionally exclude putative proteins encoded by short open reading frames (sORFs) less than 300 nucleotides in length to filter out the exponentially higher number of spurious noncoding sORFs throughout the genome. As a result, hundreds of functional small proteins called microproteins (<100 amino acids in length) have been incorrectly classified as noncoding RNAs or overlooked entirely.

Here we provide a detailed protocol to leverage free, publicly available bioinformatic tools to query genomic regions for microprotein-coding potential based on evolutionary conservation. Specifically, we provide step-by-step instructions on how to examine sequence conservation and coding potential using Phylogenetic Codon Substitution Frequencies (PhyloCSF) on the user-friendly University of California Santa Cruz (UCSC) Genome Browser. Additionally, we detail steps to efficiently generate multiple species alignments of identified microprotein sequences to visualize amino acid sequence conservation and recommend resources to analyze microprotein characteristics, including predicted domain structures. These powerful tools can be used to help identify putative microprotein-coding sequences in noncanonical genomic regions or to rule out the presence of a conserved coding sequence with translational potential in a noncoding transcript of interest.

Introduction

The identification of the complete set of coding elements in the genome has been a major goal since the initiation of the Human Genome Project, and remains a central objective toward the understanding of biological systems and the etiology of genetic-based diseases¹^,²^,³^,⁴. Advances in NGS techniques have led to the production of whole genome sequences for an extensive number of organisms, including vertebrates, invertebrates, yeast, and plants⁵. Additionally, high-throughput transcriptional sequencing methods have further revealed the complexity of the cellular transcriptome, and identified thousands of novel RNA molecules with both protein-coding and noncoding functions⁶^,⁷. Decoding this vast amount of sequence information is an ongoing process, and challenges remain with comprehensive gene annotation efforts⁸.

The recent development of translational profiling methods, including ribosome profiling⁹^,¹⁰ and poly-ribosome sequencing¹¹, have provided evidence indicating that hundreds of noncanonical translation events map to currently unannotated sORFs throughout the genome, with the potential to generate small proteins called microproteins or micropeptides¹²^,¹³^,¹⁴^,¹⁵^,¹⁶^,¹⁷. Microproteins have emerged as a novel class of versatile proteins previously overlooked by standard gene annotation methods due to their small size (<100 amino acids) and lack of classical protein-coding gene characteristics⁸^,¹²^,¹⁸^,¹⁹^,²⁰. Microproteins have been described in virtually all organisms, including yeast²¹^,²², flies¹⁷^,²³^,²⁴, and mammals²⁵^,²⁶^,²⁷^,²⁸, and have been shown to play critical roles in diverse processes, including development, metabolism, and stress signaling¹⁹^,²⁰^,²⁹^,³⁰^,³¹^,³²^,³³^,³⁴. Thus, it is imperative to continue to mine the genome for additional members of this long-overlooked class of functional small proteins.

Despite the widespread recognition of the biological importance of microproteins, this class of genes remains vastly underrepresented in genome annotations, and their accurate identification continues to be an ongoing challenge that has hindered progress in the field. Various computational tools and experimental methods have recently been developed to overcome the difficulties associated with identifying microprotein-coding sequences (discussed extensively in several comprehensive reviews⁸^,³⁵^,³⁶^,³⁷). Many recent microprotein identification studies³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴²^,⁴³^,⁴⁴^,⁴⁵^,⁴⁶^,⁴⁷ have relied heavily on the use of one such algorithm called PhyloCSF⁴⁸^,⁴⁹, a powerful comparative genomics approach that can be leveraged to distinguish conserved protein-coding regions of the genome from those that are noncoding.

PhyloCSF compares codon substitution frequencies (CSF) using multi-species nucleotide alignments and phylogenetic models to detect evolutionary signatures of protein-coding genes. This empirical model-based approach relies on the premise that proteins are primarily conserved at the amino acid level rather than the nucleotide sequence. Therefore, synonymous codon substitutions, which encode the same amino acid, or codon substitutions to amino acids with conserved properties (i.e., charge, hydrophobicity, polarity) are scored positively, while non-synonymous substitutions, including missense and nonsense substitutions, score negatively. PhyloCSF is trained on whole-genome data and has proven to be effective in scoring short portions of a coding sequence (CDS) in isolation from the full sequence, which is necessary when analyzing microproteins or individual exons of standard protein-coding genes⁴⁸^,⁴⁹.

Notably, the recent integration of the PhyloCSF track hubs in the University of California Santa Cruz (UCSC) Genome Browser⁴⁹^,⁵⁰^,⁵¹ enables investigators of all backgrounds to easily access a user-friendly interface to query genomic regions of interest for protein-coding potential. The protocol outlined below provides detailed instruction on how to load the PhyloCSF track hubs on the UCSC Genome Browser and subsequently interrogate genomic regions of interest to probe for high-confidence protein-coding regions (or the lack thereof). Additionally, in the case where a positive PhyloCSF score is observed, steps are delineated to further analyze microprotein-coding potential and efficiently generate multiple species alignments of the identified amino acid sequences to illustrate cross-species sequence conservation. Lastly, several additional publicly available resources and tools are introduced in the discussion to survey identified microprotein characteristics, including predicted domain structures and insight into putative microprotein function.

Protocol

The protocol outlined below details steps to load and navigate the PhyloCSF browser tracks on the UCSC Genome Browser (generated by Mudge et al.⁴⁹). For general questions regarding the UCSC Genome Browser, an extensive Genome Browser User's Guide can be found here: https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html.

1. Loading the PhyloCSF Track Hub to the UCSC Genome Browser

Open an internet browser window and navigate to the UCSC Genome Browser (https://genome.ucsc.edu/).
Under the Our tools heading, select the Track Hubs option.
NOTE: The Track Hubs option can also be found under the My Data tab.
In the Public Hubs tab, type PhyloCSF into the Search terms box. Click on the Search Public Hubs button.
Connect to PhyloCSF by clicking on the Connect button for the Hub Name PhyloCSF (Description: Evolutionary protein-coding potential as measured by PhyloCSF).
NOTE: This Track Hub will load to numerous assemblies, including human (hg19 and hg38) and mouse (mm10 and mm39).
After clicking on connect, wait to be redirected to the UCSC Genome Browser Gateway page (https://genome.ucsc.edu/cgi-bin/hgGateway).

2. Navigating to genes of interest using Gene Identifiers

Select the species and genome assembly to query. To query a different species (e.g., mouse), select the species of interest under the Browse/Select Species heading by clicking on the appropriate icon, or type the species into the text box that says, Enter species, common name or assembly ID.
NOTE: The assembly is listed directly under the Find Position heading. Typically, the default is the Human Assembly (e.g., Dec. 2009 [GRCh37/hg19]).
Choose the assembly to search under the Find Position heading using the dropdown menu.
Enter the position, gene symbol, or search terms in the Position/Search Term box and click on Go to navigate to a gene of interest on the Genome Browser.
If the search resulted in multiple matches, wait to be redirected to a page that requires the selection of a position of interest. Click on the appropriate gene of interest.

3. Navigating to genomic regions of interest using sequence information

Navigate to the UCSC Genome Browser (https://genome.ucsc.edu/) and select the BLAST-Like Alignment Tool (BLAT) under the Our tools heading to query a specific DNA or protein sequence. Alternatively, hover the cursor over the Tools tab and select the Blat option or follow this link: https://genome.ucsc.edu/cgi-bin/hgBlat.
Select the species (Genome) and Assembly of interest using the dropdown menus.
Define the Query type using the dropdown menu.
Paste the sequence of interest into the BLAT Search Genome text box and click Submit.
Click on the browser link under the ACTIONS heading to navigate to the genomic region of interest.

4. Identifying conserved sORFs using PhyloCSF Track Data

Visually scan the genomic area of interest for positively scoring PhyloCSF regions (Figure 1).
NOTE: For a detailed explanation of how to visually interpret PhyloCSF scores on the UCSC Genome Browser, see the representative results section below.
Use the zoom feature to magnify regions of interest to examine sequence characteristics and search for start/stop codons. To zoom in manually, hold the shift key and click and hold the mouse button while dragging along the region of interest. Alternatively, use the zoom in and zoom out buttons at the top of the page to navigate (1.5x, 3x, 10x, or base zoom options are available).
NOTE: Before using the zoom in/zoom out buttons, it is necessary to reposition the gene so that the region of interest is in the middle of the screen. To perform this action, click on the image and drag it left or right to move the genomic region horizontally as desired or use the move arrows at the top of the page.
Zoom in until the nucleotide (base) sequence is visible.
NOTE: The nucleotide sequence will appear directly above the +1 Smoothed PhyloCSF score.
Visually scan the nucleotide sequence near the beginning and end of the positively scoring PhyloCSF regions to identify putative start (ATG) and stop (TGA/TAA/TAG) codons.
NOTE: If the gene of interest is on the minus strand of DNA, the start and stop codons will be the reverse complement (i.e., CAT for the start codon and TCA/TTA/CTA for the stop codon).

5. Viewing homologous regions in other genomes

Hover the mouse over the View heading at the top of the page and click on the In Other Genomes (Convert) option.
Define the genome of interest using the dropdown menu below the New Genome heading.
Select the genomic assembly of interest using the dropdown menu under the New Assembly heading, then click the Submit button.
Once the browser returns a list of regions in the new assembly with similarity, click on the chromosome position link to navigate to the homologous region of interest.
NOTE: The percentage of total bases (nucleotides) and the span that are covered by the region will be defined for each region listed. The higher the percentage of matching bases, the higher the conservation is for the region of interest.
Follow the same navigational strategies detailed in Section 4 to analyze the sequence.

6. Generating multi-species sequence alignments for microproteins of interest

Click on the gene of interest in the GENCODE track on the UCSC Genome Browser (indicated in Figure 1A with a blue box) to navigate to the gene description page.
Under the Sequence and Links to Tools and Databases heading, click on the link in the table that reads Other Species FASTA.
Click on the boxes associated with the species of interest to select them. Click on Submit. Copy and paste the sequences appearing at the bottom of the page in FASTA format into a word processing document.
Open a second browser window and navigate to the Clustal Omega Multiple Sequence Alignment tool⁵² on the European Bioinformatics Institute (EMBL-EBI) website⁵³^,⁵⁴: https://www.ebi.ac.uk/Tools/msa/clustalo/.
Paste the sequence files that are still on the clipboard into the box in STEP 1 that reads sequences in any supported format. Scroll to the bottom of the page and click on Submit. Look below the aligned results (in black font) for symbols that indicate the degree of conservation of each amino acid (symbols are defined in Table 1).
NOTE: It may take several minutes to generate the alignment.
To view the amino acid properties in color, click on the Show Colors link directly above the sequences to color the amino acids according to their properties (defined in Table 2).
Copy and paste the sequence alignment into a word processing or slideshow program to generate a figure or illustration file (e.g., Figure 2).
NOTE: Use a monospaced font for the alignment such as Courier.
To view other outputs from the Clustal Omega results page, click on the appropriate tabs (i.e., Guide Tree or Phylogenetic Tree).
Click on the Results Viewers tab for options to view the sequence information using Jalview, a free program that specializes in multiple sequence alignment editing, visualization, and analysis⁵⁵, or to access direct links to MView and Simple Phylogeny⁵⁶.

Results

Here we will use the validated microprotein mitoregulin (Mtln) as an example to demonstrate how a conserved sORF will generate a positive PhyloCSF score that can be easily visualized and analyzed on the UCSC Genome Browser. Mitoregulin was previously annotated as a noncoding RNA (formerly human gene ID LINC00116 and mouse gene ID 1500011K16Rik). Comparative genomics and sequence conservation analysis methods played a critical role in its initial discovery⁴⁰^,

Discussion

The protocol presented here provides detailed instructions on how to interrogate genomic regions of interest for microprotein-coding potential using PhyloCSF on the user-friendly UCSC Genome Browser⁴⁸^,⁴⁹^,⁵⁰^,⁵¹. As detailed above, PhyloCSF is a powerful comparative genomics algorithm that integrates phylogenetic models and codon substitution frequencies to identify evolutionary signatures that a...

Disclosures

The authors declare that they have no competing financial interests.

Acknowledgements

This work was supported by grants from the National Institutes of Health (HL-141630 and HL-160569) and Cincinnati Children's Research Foundation (Trustee Award).

Materials

Name	Company	Catalog Number	Comments
Website	Website Address	Requirements
Clustal Omega Multiple Sequence Alignment Tool	https://www.ebi.ac.uk/Tools/msa/clustalo/	Web browser	Multiple sequence alignment program for the efficient alignment of FASTA sequences (i.e. for cross-species comparison of identified microproteins)
COXPRESSdb	https://coxpresdb.jp	Web browser	Provides co-regulated gene relationships to estimate gene functions
EMBL-EBI Bioinformatics Tools FAQs	https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Bioinformatics+Tools+FAQ	Web browser	Frequently Asked Questions (FAQs) for EMBL-EBI tools. Includes the color coding key for protein sequence alignments
European Bioinformatics Institute (EMBL-EBI), Tools and Data Resources	https://www.ebi.ac.uk/services/all	Web browser	Comprehensive list of freely available websites, tools and data resources
Expasy - Swiss Bioinformatics Resource Portal	https://www.expasy.org	Web browser	Suite of bioinformatic tools and resources for protein sequence analysis that is maintained by the Swiss Institute of Bioinformatics (SIB)
National Center for Biotechnology Information (NCBI) Conserved Domain Search	https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi	Web browser	Search tool to identify conserved domains within protein or coding nucleotide sequences
Pfam 35	http://pfam.xfam.org	Web browser	Protein family (Pfam) database, provides alignments and classification of protein families and domains
PhyloCSF Track Hub Description	https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=1267045267_TEc99h2oW5Q edaCd4ir8aZ65ryaD&db=mm10 &c=chr2&g=hub_109801_ PhyloCSF_smooth	Web browser	Detailed description of the Smoothed PhyloCSF tracks and PhyloCSF Track Hub
SignalP 6.0	https://services.healthtech.dtu.dk/service.php?SignalP-6.0	Web browser	Predicts the presence of signal peptides and the location of their cleavage sites
TMHMM - 2.0	https://services.healthtech.dtu.dk/service.php?TMHMM-2.0	Web browser	Prediction of transmembrane helices in proteins
UCSC Genome Browser BLAT Search	https://genome.ucsc.edu/cgi-bin/hgBlat	Web browser	Tool used to find genomic regions using DNA or protein sequence information
UCSC Genome Browser Gateway	https://genome.ucsc.edu/cgi-bin/hgGateway	Web browser	Direct link to the UCSC Genome Browser Gateway
UCSC Genome Browser Home	https://genome.ucsc.edu/	Web browser	Home website for the UCSC Genome Browser
UCSC Genome Browser Track Data Hubs	https://genome.ucsc.edu/cgi-bin/hgHubConnect#publicHubs	Web browser	Direct link to Track Data Hubs/Public Hubs database to search for and load the PhyloCSF Tracks
UCSC Genome Browser User Guide	https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html	Web browser	Comprehensive user guide detailing how to navigate the UCSC Genome Browser
WoLF PSORT	https://wolfpsort.hgc.jp	Web browser	Protein subcellular localization prediction tool

References

Collins, F. S., Morgan, M., Patrinos, A. The human genome project: lessons from large-scale biology. Science. 300 (5617), 286-290 (2003).
Lander, E. S., et al. Initial sequencing and analysis of the human genome. Nature. 409 (6822), 860-921 (2001).
Sachidanandam, R., et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 409 (6822), 928-933 (2001).
Venter, J. C., et al. The sequence of the human genome. Science. 291 (5507), 1304-1351 (2001).
Fuentes-Pardo, A. P., Ruzzante, D. E. Whole-genome sequencing approaches for conservation biology: Advantages, limitations and practical recommendations. Molecular Ecology. 26 (20), 5369-5406 (2017).
Carninci, P., et al. The transcriptional landscape of the mammalian genome. Science. 309 (5740), 1559-1563 (2005).
Maeda, N., et al. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genetics. 2 (4), 62 (2006).
Schlesinger, D., Elsasser, S. J. Revisiting sORFs: overcoming challenges to identify and characterize functional microproteins. The FEBS Journal. 289 (1), 53-74 (2022).
Ingolia, N. T., et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Reports. 8 (5), 1365-1379 (2014).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R., Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 324 (5924), 218-223 (2009).
Aspden, J. L., et al. Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. Elife. 3, 03528 (2014).
Andrews, S. J., Rothnagel, J. A. Emerging evidence for functional peptides encoded by short open reading frames. Nature Reviews Genetics. 15 (3), 193-204 (2014).
Mackowiak, S. D., et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biology. 16 (1), 1-21 (2015).
Ruiz-Orera, J., Messeguer, X., Subirana, J. A., Alba, M. M. Long non-coding RNAs as a source of new peptides. Elife. 3, 03523 (2014).
Basrai, M. A., Hieter, P., Boeke, J. D. Small open reading frames: beautiful needles in the haystack. Genome Research. 7 (8), 768-771 (1997).
Frith, M. C., et al. The abundance of short proteins in the mammalian proteome. PLoS Genetics. 2 (4), 52 (2006).
Ladoukakis, E., Pereira, V., Magny, E. G., Eyre-Walker, A., Couso, J. P. Hundreds of putatively functional small open reading frames in Drosophila. Genome Biology. 12 (11), 118 (2011).
Makarewich, C. A., Olson, E. N. Mining for Micropeptides. Trends in Cell Biology. 27 (9), 685-696 (2017).
Wright, B. W., Yi, Z., Weissman, J. S., Chen, J. The dark proteome: translation from noncanonical open reading frames. Trends in Cell Biology. , (2021).
Saghatelian, A., Couso, J. P. Discovery and characterization of smORF-encoded bioactive polypeptides. Nature Chemical Biology. 11 (12), 909-916 (2015).
Kastenmayer, J. P., et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Research. 16 (3), 365-373 (2006).
Smith, J. E., et al. Translation of small open reading frames within unannotated RNA transcripts in Saccharomyces cerevisiae. Cell Reports. 7 (6), 1858-1866 (2014).
Lin, M. F., et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Genome Research. 17 (12), 1823-1836 (2007).
Magny, E. G., et al. Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science. 341 (6150), 1116-1120 (2013).
Bazzini, A. A., et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33 (9), 981-993 (2014).
Ingolia, N. T., Lareau, L. F., Weissman, J. S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 147 (4), 789-802 (2011).
Ma, J., et al. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res. 13 (3), 1757-1765 (2014).
Slavoff, S. A., et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nature Chemical Biology. 9 (1), 59-64 (2013).
Khitun, A., Ness, T. J., Slavoff, S. A. Small open reading frames and cellular stress responses. Molecular Omics. 15 (2), 108-116 (2019).
Makarewich, C. A. The hidden world of membrane microproteins. Experimental Cell Research. 388 (2), 111853 (2020).
Pueyo, J. I., Magny, E. G., Couso, J. P. New peptides under the s(ORF)ace of the genome. Trends in Biochemical Sciences. 41 (8), 665-678 (2016).
Pauli, A., et al. Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science. 343 (6172), 1248636 (2014).
Chng, S. C., Ho, L., Tian, J., Reversade, B. ELABELA: a hormone essential for heart development signals via the apelin receptor. Developmental Cell. 27 (6), 672-680 (2013).
Lee, C., et al. The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metabolism. 21 (3), 443-454 (2015).
Pauli, A., Valen, E., Schier, A. F. Identifying (non-)coding RNAs and small peptides: challenges and opportunities. Bioessays. 37 (1), 103-112 (2015).
Plaza, S., Menschaert, G., Payre, F. In search of lost small peptides. Annual Review of Cell and Developmental Biology. 33, 391-416 (2017).
Kiniry, S. J., Michel, A. M., Baranov, P. V. Computational methods for ribosome profiling data analysis. Wiley Interdisciplinary Reviews: RNA. 11 (3), 1577 (2020).
Anderson, D. M., et al. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell. 160 (4), 595-606 (2015).
Anderson, D. M., et al. Widespread control of calcium signaling by a family of SERCA-inhibiting micropeptides. Science Signaling. 9 (457), (2016).
Makarewich, C. A., et al. MOXI Is a mitochondrial micropeptide that enhances fatty acid beta-oxidation. Cell Reports. 23 (13), 3701-3709 (2018).
Nelson, B. R., et al. A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle. Science. 351 (6270), 271-275 (2016).
Chu, Q., et al. Regulation of the ER stress response by a mitochondrial microprotein. Nat Commun. 10 (1), 4883 (2019).
Senis, E., et al. TUNAR lncRNA encodes a microprotein that regulates neural differentiation and neurite formation by modulating calcium dynamics. Frontiers in Cell and Developmental Biology. 9, 747667 (2021).
Li, M., et al. A putative long noncoding RNA-encoded micropeptide maintains cellular homeostasis in pancreatic beta cells. Molecular Therapy-Nucleic Acids. 26, 307-320 (2021).
Martinez, T. F., et al. Accurate annotation of human protein-coding small open reading frames. Nature Chemical Biology. 16 (4), 458-468 (2020).
van Heesch, S., et al. The translational landscape of the human heart. Cell. 178 (1), 242-260 (2019).
Makarewich, C. A., et al. The cardiac-enriched microprotein mitolamban regulates mitochondrial respiratory complex assembly and function in mice. Proceedings of the National Academy of Sciences of the United States of America. 119 (6), 2120476119 (2022).
Lin, M. F., Jungreis, I., Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 27 (13), 275-282 (2011).
Mudge, J. M., et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research. 29 (12), 2073-2087 (2019).
Kent, W. J., et al. The human genome browser at UCSC. Genome Research. 12 (6), 996-1006 (2002).
Raney, B. J., et al. Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics. 30 (7), 1003-1005 (2014).
Sievers, F., et al. scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 7 (1), 539 (2011).
Goujon, M., et al. A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Research. 38 (2), 695-699 (2010).
Harte, N., et al. Public web-based services from the European Bioinformatics Institute. Nucleic Acids Research. 32 (2), 3-9 (2004).
Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M., Barton, G. J. Jalview Version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics. 25 (9), 1189-1191 (2009).
Madeira, F., et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Research. 47 (1), 636-641 (2019).
Friesen, M., et al. Mitoregulin controls beta-oxidation in human and mouse adipocytes. Stem Cell Reports. 14 (4), 590-602 (2020).
Stein, C. S., et al. Mitoregulin: A lncRNA-Encoded microprotein that supports mitochondrial supercomplexes and respiratory efficiency. Cell Reports. 23 (13), 3710-3720 (2018).
Chugunova, A., et al. LINC00116 codes for a mitochondrial peptide linking respiration and lipid metabolism. Proceedings of the Nationall Academy of Sciences of the United States of America. 116 (11), 4940-4945 (2019).
Lin, Y. F., et al. A novel mitochondrial micropeptide MPM enhances mitochondrial respiratory activity and promotes myogenic differentiation. Cell Death and Disease. 10 (7), 528 (2019).
Wang, L., et al. The micropeptide LEMP plays an evolutionarily conserved role in myogenesis. Cell Death and Disease. 11 (5), 357 (2020).
He, S., Liu, S., Zhu, H. The sequence, structure and evolutionary features of HOTAIR in mammals. BMC Evolutionary Biology. 11 (1), 1-14 (2011).
Rinn, J. L., et al. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell. 129 (7), 1311-1323 (2007).
Bhatta, A., et al. A Mitochondrial micropeptide is required for activation of the Nlrp3 inflammasome. Journal of Immunology. 204 (2), 428-437 (2020).
Zhang, D., et al. Functional prediction and physiological characterization of a novel short trans-membrane protein 1 as a subunit of mitochondrial respiratory complexes. Physiological Genomics. 44 (23), 1133-1140 (2012).
Rathore, A., et al. MIEF1 microprotein regulates mitochondrial translation. Biochemistry. 57 (38), 5564-5575 (2018).
Jungreis, I., Sealfon, R., Kellis, M. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nature Communications. 12 (1), 2642 (2021).
Chen, J., et al. Pervasive functional translation of noncanonical human open reading frames. Science. 367 (6482), 1140-1146 (2020).
Ruiz-Orera, J., Verdaguer-Grau, P., Villanueva-Canas, J. L., Messeguer, X., Alba, M. M. Translation of neutrally evolving peptides provides a basis for de novo gene evolution. Nature Ecology and Evolution. 2 (5), 890-896 (2018).
Blevins, W. R., et al. Uncovering de novo gene birth in yeast using deep transcriptomics. Nature Communications. 12 (1), 604 (2021).
Papadopoulos, C., et al. Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution. Genome Research. , (2021).
Vakirlis, N., Duggan, K. M., McLysaght, A. De novo birth of functional, human-specific microproteins. bioRxiv. , 462744 (2021).
Van Oss, S. B., Carvunis, A. R. De novo gene birth. PLoS Genetics. 15 (5), 1008160 (2019).
Andersson, D. I., Jerlstrom-Hultqvist, J., Nasvall, J. Evolution of new functions de novo and from preexisting genes. Cold Spring Harbor Perspectives in Biology. 7 (6), 017996 (2015).
Ge, Q., et al. Micropeptide ASAP encoded by LINC00467 promotes colorectal cancer progression by directly modulating ATP synthase activity. Journal of Clinical Investigations. 131 (22), (2021).
Sonnhammer, E. L., von Heijne, G., Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings. International Conference on Intelligent Syststems for Molecular Biology. 6, 175-182 (1998).
Lu, S., et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Research. 48, 265-268 (2020).
Mistry, J., et al. Pfam: The protein families database in 2021. Nucleic Acids Research. 49, 412-419 (2021).
Horton, P., et al. PSORT: protein localization predictor. Nucleic Acids Research. 35 (2), 585-587 (2007).
Obayashi, T., Kagaya, Y., Aoki, Y., Tadaka, S., Kinoshita, K. COXPRESdb v7: a gene coexpression database for 11 animal species supported by 23 coexpression platforms for technical evaluation and evolutionary inference. Nucleic Acids Research. 47, 55-62 (2019).
Teufel, F., et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature Biotechnology. , 01156 (2022).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

Microprotein Identification Sequence Analysis PhyloCSF UCSC Genome Browser Protein coding Potential Genomic Regions Bioinformatics Comparative Genomics Open Reading Frames Conserved Regions Alignment Tool Sequence Characteristics Gene Navigation Blast like Alignment Assembly Selection

This article has been published

Video Coming Soon

Keep me updated: