A subscription to JoVE is required to view this content. Sign in or start your free trial.
The purpose of this protocol is to investigate the evolution and expression of candidate genes using RNA sequencing data.
Distilling and reporting large datasets, such as whole genome or transcriptome data, is often a daunting task. One way to break down results is to focus on one or more gene families that are significant to the organism and study. In this protocol, we outline bioinformatic steps to generate a phylogeny and to quantify the expression of genes of interest. Phylogenetic trees can give insight into how genes are evolving within and between species as well as reveal orthology. These results can be enhanced using RNA-seq data to compare the expression of these genes in different individuals or tissues. Studies of molecular evolution and expression can reveal modes of evolution and conservation of gene function between species. The characterization of a gene family can serve as a springboard for future studies and can highlight an important gene family in a new genome or transcriptome paper.
Advances in sequencing technologies have facilitated the sequencing of genomes and transcriptomes of non-model organisms. In addition to the increased feasibility of sequencing DNA and RNA from many organisms, an abundance of data is publicly available to study genes of interest. The purpose of this protocol is to provide bioinformatic steps to investigate the molecular evolution and expression of genes that may play an important role in the organism of interest.
Investigating the evolution of a gene or gene family can provide insight into the evolution of biological systems. Members of a gene family are typically determined by identifying conserved motifs or homologous gene sequences. Gene family evolution was previously investigated using genomes from distantly related model organisms1. A limitation to this approach is that it is not clear how these gene families evolve in closely related species and the role of different environmental selective pressures. In this protocol, we include a search for homologs in closely related species. By generating a phylogeny at a phylum level, we can note trends in gene family evolution such as that of conserved genes or lineage-specific duplications. At this level, we can also investigate whether genes are orthologs or paralogs. While many homologs likely function similarly to each other, that is not necessarily the case2. Incorporating phylogenetic trees in these studies is important to resolve whether these homologous genes are orthologs or not. In eukaryotes, many orthologs retain similar functions within the cell as evidenced by the ability of mammalian proteins to restore the function of yeast orthologs3. However, there are instances where a non-orthologous gene carries out a characterized function4.
Phylogenetic trees begin to delineate relationships between genes and species, yet function cannot be assigned solely based on genetic relationships. Gene expression studies combined with functional annotations and enrichment analysis provide strong support for gene function. Cases where gene expression can be quantified and compared across individuals or tissue types can be more telling of potential function. The following protocol follows methods used in investigating opsin genes in Hydra vulgaris7, but they can be applied to any species and any gene family. The results of such studies provide a foundation for further investigation into gene function and gene networks in non-model organisms. As an example, the investigation of the phylogeny of opsins, which are proteins that initiate the phototransduction cascade, gives context to the evolution of eyes and light detection8,9,10,11. In this case, non-model organisms especially basal animal species such as cnidarians or ctenophores can elucidate conservation or changes in the phototransduction cascade and vision across clades12,13,14. Similarly, determining the phylogeny, expression, and networks of other gene families will inform us about the molecular mechanisms underlying adaptations.
This protocol follows UC Irvine animal care guidelines.
1. RNA-seq library preparation
2. Access a computer cluster
NOTE: RNA-seq analysis requires manipulation of large files and is best done on a computer cluster (Table of Materials).
3. Obtain RNA-seq reads
4. Trim adapters and low-quality reads (optional)
5. Obtain reference assembly
6. Generate a de novo assembly (Alternative to Step 5)
7. Map reads to the genome (7.1) or de novo transcriptome (7.2)
8. Identify genes of interest
NOTE: The following steps can be done with nucleotide or protein FASTA files but work best and are more straightforward with protein sequences. BLAST searches using protein to protein is more likely to give results when searching between different species.
9. Phylogenetic trees
10. Visualize gene expression using TPM
The methods above are summarized in Figure 1 and were applied to a data set of Hydra vulgaris tissues. H. vulgaris is a fresh-water invertebrate that belongs to the phylum Cnidaria which also includes corals, jellyfish, and sea anemones. H. vulgaris can reproduce asexually by budding and they can regenerate their head and foot when bisected. In this study, we aimed to investigate the evolution and expression of opsin genes in Hydra
The purpose of this protocol is to provide an outline of the steps for characterizing a gene family using RNA-seq data. These methods have been proven to work for a variety of species and datasets4,34,35. The pipeline established here has been simplified and should be easy enough to be followed by a novice in bioinformatics. The significance of the protocol is that it outlines all the steps and necessary programs to complete a p...
The authors have nothing to disclose.
We thank Adriana Briscoe, Gil Smith, Rabi Murad and Aline G. Rangel for advice and guidance in incorporating some of these steps into our workflow. We are also grateful to Katherine Williams, Elisabeth Rebboah, and Natasha Picciani for comments on the manuscript. This work was supported in part by a George E. Hewitt Foundation for Medical research fellowship to A.M.M.
Name | Company | Catalog Number | Comments |
Bioanalyzer-DNA kit | Agilent | 5067-4626 | wet lab materials |
Bioanalyzer-RNA kit | Agilent | 5067-1513 | wet lab materials |
BLAST+ v. 2.8.1 | On computer cluster* https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ | ||
Blast2GO (on your PC) | On local computer https://www.blast2go.com/b2g-register-basic | ||
boost v. 1.57.0 | On computer cluster | ||
Bowtie v. 1.0.0 | On computer cluster https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.3.0/ | ||
Computing cluster (highly recommended) | NOTE: Analyses of genomic data are best done on a high-performance computing cluster because files are very large. | ||
Cufflinks v. 2.2.1 | On computer cluster | ||
edgeR v. 3.26.8 (in R) | In Rstudio https://bioconductor.org/packages/release/bioc/html/edgeR.html | ||
gcc v. 6.4.0 | On computer cluster | ||
Java v. 11.0.2 | On computer cluster | ||
MEGA7 (on your PC) | On local computer https://www.megasoftware.net | ||
MEGAX v. 0.1 | On local computer https://www.megasoftware.net | ||
NucleoSpin RNA II kit | Macherey-Nagel | 740955.5 | wet lab materials |
perl 5.30.3 | On computer cluster | ||
python | On computer cluster | ||
Qubit 2.0 Fluorometer | ThermoFisher | Q32866 | wet lab materials |
R v.4.0.0 | On computer cluster https://cran.r-project.org/src/base/R-4/ | ||
RNAlater | ThermoFisher | AM7021 | wet lab materials |
RNeasy kit | Qiagen | 74104 | wet lab materials |
RSEM v. 1.3.0 | Computer software https://deweylab.github.io/RSEM/ | ||
RStudio v. 1.2.1335 | On local computer https://rstudio.com/products/rstudio/download/#download | ||
Samtools v. 1.3 | Computer software | ||
SRA Toolkit v. 2.8.1 | On computer cluster https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit | ||
STAR v. 2.6.0c | On computer cluster https://github.com/alexdobin/STAR | ||
StringTie v. 1.3.4d | On computer cluster https://ccb.jhu.edu/software/stringtie/ | ||
Transdecoder v. 5.5.0 | On computer cluster https://github.com/TransDecoder/TransDecoder/releases | ||
Trimmomatic v. 0.35 | On computer cluster http://www.usadellab.org/cms/?page=trimmomatic | ||
Trinity v.2.8.5 | On computer cluster https://github.com/trinityrnaseq/trinityrnaseq/releases | ||
TRIzol | ThermoFisher | 15596018 | wet lab materials |
TruSeq RNA Library Prep Kit v2 | Illumina | RS-122-2001 | wet lab materials |
TURBO DNA-free Kit | ThermoFisher | AM1907 | wet lab materials |
*Downloads and installation on the computer cluster may require root access. Contact your network administrator. |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved