The protocol described here provides detailed instructions on analyzing genomic regions of interest for protein-coding potential using phyloCSF on the user-friendly UCSC genome browser. PhloCSF can effectively identify conserved short open reading frames with micro protein-coding potential in genomic regions that are currently annotated as non-coding. The methods described here are easily used and can be implemented by investigators of all backgrounds without prior training or expertise in bioiformatics or comparative genomics.
To begin, open an internet browser window and navigate to the University of California Santa Cruz or UCSC genome browser. Under the our tools heading, select the track hubs option. In the public hubs tab type phyloCSF into the search terms box.
Then, click on the search public hubs button. Connect to phyloCSF by clicking on the connect button for the hub name phyloCSF. After clicking on connect, wait to redirect to the UCSC genome browser gateway page.
To query a different species, select the species of interest under the browse or select species heading by clicking on the appropriate icon, or type the species into the text box that says, enter species common name or assembly ID.Using the dropdown menu choose the assembly to search under defined position heading then enter the position gene symbol or search terms in the position or search term box and click on go to navigate to a gene of interest on the genome browser. If the search resulted in multiple matches wait to be redirected to a page that requires the selection of a position of interest, then click on the appropriate gene of interest. After navigating to the UCSC genome browser select the blast-like alignment tool or blat under the our tools heading to query a specific DNA or protein sequence.
Alternatively, hover the cursor over the tools tab and select the blat option or follow the given link. Using the dropdown menu select the species, genome, and assembly of interest. Then, define the query type, paste the sequence of interest into the blat search genome text box and click submit.
Next, click on the browser link under the actions heading to navigate to the genomic region of interest. Visually scan the genomic area of interest for positively scoring phyloCSF regions. Use the zoom feature to magnify regions of interest to examine sequence characteristics and search for the start and stop codons.
To zoom in manually hold the shift key and click and hold the mouse button while dragging along the region of interest. Alternatively, use the zoom in and zoom out buttons at the top of the page to navigate. Zoom in until the nucleotide or base sequence is visible.
Visually scan the nuclear tide sequence near the beginning and end of the positively scoring phyloCSF regions to identify punitive start and stop codons. Hover the mouse cursor over the view heading at the top of the page and click on the in other genomes convert option, then define the genome of interest using the dropdown menu below the new genome heading. Select the genomic assembly of interest under the new assembly heading and click the submit button.
Once the browser returns a list of regions in the new assembly with similarity. Click on the chromosome position link to navigate to the homologous region of interest. Follow the navigational strategies described earlier to analyze the sequence.
To navigate to the gene description page, click on the gene of interest in the gen code track on the UCSC genome browser. Under the sequence and links to tools and databases heading click on the link in the table that reads other species faster. Click on the boxes associated with the species of interest to select them.
Then, click on submit. Copy and paste the sequences appearing at the bottom of the page in faster format into a word processing document. Next, open a second browser window and navigate to the clustal omega multiple sequence alignment tool on the European Bioinformatics Institute website.
Paste the sequence files on the clipboard into the box in step one that reads sequences in any supported format. Scroll to the bottom of the page and click on submit. Observe below the aligned results for symbols that indicate the degree of conservation of each amino acid.
To view the amino acid properties and color click on the show colors link directly above the sequences to color the amino acids according to their properties. Then copy and paste the sequence alignment into a word processing or slide show program to generate a figure or illustration file. To view other outputs from the clustal omega results page, click on the tabs guide tree or phylo genetic tree.
Finally, click on the results viewer's tab for options to view the sequence information using jalview or to access direct links to mview and simple phylogeny. A representative phyloCSF analysis of the mitoregulin gene indicates a region of high sequence conservation corresponding to a validated micro protein. The complete mitoregulin coding sequence is contained within exon one and scores very highly on the phyloCSF minus one track.
A conserved start codon can be observed at the beginning of the positively scoring region in the phyloCSF minus one track. The positively scoring region in the first exon of mitoregulin begins directly over a start codon and terminates at the stop codon. The multiple sequence alignment of the micro protein mitoregulin for eight different species is shown here.
The analysis of the long noncoding RNA hot air showed a negative score throughout the entire gene across all six tracks indicating a lack of sequence conservation and supporting that hot air is correctly annotated as a noncoding RNA. PhyloCSF analysis of the mouse one, eight, one, zero, zero, five, eight, I 24 rike gene showed that a conserved open reading frame spans three exons and the positive phyloCSF score jumps from the plus two track in exon one to the plus three track in exon two, and then back to the plus two track in exon three. PhyloCSF analysis of the meet one gene locus was also effectively used to identify multiple distinct coding open reading frames within a single RNA molecule.
It is important to note that while a positive phyloCSF score is highly suggestive of micro protein-coding capacity this line of evidence cannot stand alone and must be experimentally validated. Once a period of micro protein has been identified the amino acid sequence can be analyzed for conserved domains or sequence characteristics to provide insight into its function. PhyloCSF has been effectively used to identify novel micro proteins in genomic regions previously thought to be non-coding and will continue to be a helpful tool in future micro protein identification studies.