We thank Adriana Briscoe, Gil Smith, Rabi Murad and Aline G. Rangel for advice and guidance in incorporating some of these steps into our workflow. We are also grateful to Katherine Williams, Elisabeth Rebboah, and Natasha Picciani for comments on the manuscript. This work was supported in part by a George E. Hewitt Foundation for Medical research fellowship to A.M.M.

This protocol follows UC Irvine animal care guidelines.
1. RNA-seq library preparation
<ol>
	<li>Isolate RNA using the following methods.
	<ol>
		<li>Collect samples. If RNA is to be extracted at a later time, flash freeze the sample or place in RNA storage solution15 (Table of Materials).</li>
		<li>Euthanize and dissect the organism to separate tissues of interest.</li>
		<li>Extract total RNA using an extraction kit and purify the RNA using an RNA purification kit (Table of Materials) 
		NOTE: There are protocols and kits that may work better for different species and tissue types16,17. We have extracted RNA from different body tissues of a butterfly18 and a gelatinous Hydra19 (see discussion).</li>
		<li>Measure the concentration and quality of the RNA of each sample (Table of Materials). Use samples with RNA integrity numbers (RIN) higher than 8, ideally closer to 920 to construct cDNA libraries.</li>
	</ol>
	</li>
	<li>Construct cDNA library and sequence as follows.
	<ol>
		<li>Build cDNA libraries according to the library prep instruction manual (see discussion).</li>
		<li>Determine cDNA concentration and quality (Table of Materials).</li>
		<li>Multiplex the libraries and sequence them.</li>
	</ol>
	</li>
</ol>
2. Access a computer cluster
NOTE: RNA-seq analysis requires manipulation of large files and is best done on a computer cluster (Table of Materials).
<ol>
	<li>Login to the computer cluster account using the command ssh username@clusterlocation on a terminal (Mac) or PuTTY (Windows) application window.</li>
</ol>
3. Obtain RNA-seq reads
<ol>
	<li>Obtain RNA-seq reads from the sequencing facility or, for data generated in a publication, from the data repository where it was deposited (3.2 or 3.3).</li>
	<li>To download data from repositories such as ArrayExpress do the following:
	<ol>
		<li>Search the site using the accession number.</li>
		<li>Find the link to download the data, then left-click and select Copy Link.</li>
		<li>On the terminal window, type wget and select Paste link to copy the data to the directory for analysis.</li>
	</ol>
	</li>
	<li>To download NCBI Short Read Archive (SRA) data follow these alternative steps:
	<ol>
		<li>On the terminal download SRA Toolkit v. 2.8.1 using wget. 
		NOTE: Downloading and installing programs to the computer cluster may require root access, contact your computer cluster administrator if installation fails.</li>
		<li>Finish installing the program by typing tar -xvf $TARGZFILE.</li>
		<li>Search NCBI for the SRA accession number for the samples you want to download, it should have the format SRRXXXXXX.</li>
		<li>Obtain the RNA-seq data by typing [sratoolkit location]/bin/prefetch SRRXXXXXX in the terminal window.</li>
		<li>For paired-end files type [sratoolkit location]/bin/fastq-dump --split-files SRRXXXXXX to get two fastq files (SRRXXXXXX_1.FASTQ and SRRXXXXXX_2.FASTQ). 
		NOTE: To do a Trinity de novo assembly use the command [sratoolkit location]/bin/fastq-dump --defline-seq &#39;@$sn[_$rn]/$ri&#39; --split-files SRRXXXXXX</li>
	</ol>
	</li>
</ol>
4. Trim adapters and low-quality reads (optional)
<ol>
	<li>Install or load Trimmomatic21 v. 0.35 on the computing cluster.</li>
	<li>In the directory where the RNA-seq data files are located, type a command that includes the location of the trimmomatic jar file, the input FASTQ files, output FASTQ files, and optional parameters such as read length and quality. 
	NOTE: The command will vary by the raw and desired quality and length of the reads. For Illumina 43 bp reads with Nextera primers, we used: java -jar /data/apps/trimmomatic/0.35/trimmomatic-0.35.jar PE $READ1.FASTQ $READ2.FASTQ paired_READ1.FASTQ unpaired_READ1.FASTQ paired_READ2.FASTQ unpaired_READ2.FASTQ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:17 MINLEN:30.</li>
</ol>
5. Obtain reference assembly
<ol>
	<li>Search google, EnsemblGenomes, and NCBI Genomes and Nucleotide TSA (Transcriptome Shotgun Assembly) for a reference genome or assembled transcriptome for the species of interest (Figure 1). 
	NOTE: If a reference genome or transcriptome are not available or low-quality, proceed to STEP 6 to generate a de novo assembly.</li>
	<li>If a reference genome or assembled transcriptome exists, download it as a fasta file to where the analysis will be performed following the steps below.
	<ol>
		<li>Find the link to download the genome, left-click and Copy Link.</li>
		<li>On the terminal window type wget and paste the link address. If available, also copy the GTF file and protein FASTA file for the reference genome.</li>
	</ol>
	</li>
</ol>
6. Generate a de novo assembly (Alternative to Step 5)
<ol>
	<li>Combine the RNA-seq READ1 and READ2 fastq files for all samples by typing cat *READ1.FASTQ &#62; $all_READ1.FASTQ and cat *READ2.FASTQ &#62; all_READ2.FASTQ on the terminal window.</li>
	<li>Install or load Trinity22 v.2.8.5 on the computing cluster.</li>
	<li>Generate and assembly by typing on the terminal: Trinity --seqType fq --max_memory 20G --left $all_READ1.FASTQ --right $all_READ2.FASTQ.</li>
</ol>
7. Map reads to the genome (7.1) or de novo transcriptome (7.2)
<ol>
	<li>Map reads to the reference genome using STAR23 v. 2.6.0c and RSEM24 v. 1.3.0.
	<ol>
		<li>Install or load STAR v. 2.6.0c. and RSEM v. 1.3.0 to the computing cluster.</li>
		<li>Index the genome by typing rsem-prepare-reference --gtf $GENOME.GTF --star -p 16 $GENOME.FASTA $OUTPUT.</li>
		<li>Map reads and calculate expression for each sample by typing rsem-calculate-expression -p 16 --star --paired-end $READ1.FASTQ $READ2.FASTQ $INDEX $OUTPUT.</li>
		<li>Rename the results file to something descriptive using mv RSEM.genes.results $sample.genes.results.</li>
		<li>Generate a matrix of all counts by typing rsem-generate-data-matrix *[genes/isoforms.results] &#62; $OUTPUT.</li>
	</ol>
	</li>
	<li>Map RNA-seq to the Trinity de novo assembly using RSEM and bowtie.
	<ol>
		<li>Install or load Trinity22 v.2.8.5, Bowtie25 v. 1.0.0, and RSEM v. 1.3.0.</li>
		<li>Map reads and calculate expression for each sample by typing [trinity_location]/align_and_estimate_abundance.pl --prep-reference --transcripts $TRINITY.FASTA --seqType fq --left $READ1.FASTQ --right $READ2.FASTQ --est_method RSEM --aln_method bowtie --trinity_mode --output_dir $OUTPUT.</li>
		<li>Rename the results file to something descriptive using mv RSEM.genes.results $sample.genes.results.</li>
		<li>Generate a matrix of all counts by typing [trinity_location]/abundance_estimates_to_matrix.pl --est_method RSEM *[genes/isoforms].results</li>
	</ol>
	</li>
</ol>
8. Identify genes of interest
NOTE: The following steps can be done with nucleotide or protein FASTA files but work best and are more straightforward with protein sequences. BLAST searches using protein to protein is more likely to give results when searching between different species.
<ol>
	<li>For a reference genome, use the protein FASTA file from STEP 5.2.2 or see Supplemental Materials to generate a custom gene feature GTF.</li>
	<li>For a de novo transcriptome, generate a protein FASTA using TransDecoder.
	<ol>
		<li>Install or load TransDecoder v. 5.5.0 on the computer cluser.</li>
		<li>Find the longest open reading frame and predicted peptide sequence by typing [Transdecoder location]/TransDecoder.LongOrfs -t $TRINITY.FASTA.</li>
	</ol>
	</li>
	<li>Search NCBI Genbank for homologs in closely related species.
	<ol>
		<li>Open an internet browser window and go to https://www.ncbi.nlm.nih.gov/genbank/.</li>
		<li>On the search bar type the name of the gene of interest and the name of closely related species which have been sequenced or genus or phylum. On the left of the search bar select protein then click search.</li>
		<li>Extract sequences by clicking Send to and then select File. Under Format, select FASTA then click Create File.</li>
		<li>Move FASTA file of homologs to the computer cluster by typing scp $FASTA username@clusterlocation:/$DIR on a local terminal window or use FileZilla to transfer files to and from computer and cluster.</li>
	</ol>
	</li>
	<li>Search for candidate genes using BLAST+26.
	<ol>
		<li>Install or load BLAST+ v. 2.8.1 on the computer cluster.</li>
		<li>On the computer cluster, make a BLAST database from the genome or transcriptome translated protein FASTA by typing [BLAST+ location]/makeblastdb -in $PEP.FASTA -dbtype prot -out $OUTPUT</li>
		<li>BLAST the homologous gene sequences from NCBI to the database of the species of interest by typing [BLAST+ location]/blastp -db $DATABASE -query $FASTA -evalue 1e-10 -outfmt 6 -max_target_seqs 1 -out $OUTPUT.</li>
		<li>View the output file using the command more. Copy unique gene IDs from the species of interest to a new text file.</li>
		<li>Extract the sequences of candidate genes by typing perl -ne &#39;if(/^&#62;(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV&#39; $gene_id.txt $PEP.FASTA &#62; $OUTPUT.</li>
	</ol>
	</li>
	<li>Confirm gene annotation using reciprocal BLAST.
	<ol>
		<li>On the internet browser go to https://blast.ncbi.nlm.nih.gov/Blast.cgi.</li>
		<li>Select tblastn, then paste the candidate sequences, select the Non-redundant protein sequence database and click BLAST.</li>
	</ol>
	</li>
	<li>Identify additional genes by annotating all genes in the genome or transcriptome with gene ontology (GO) terms (see discussion).
	<ol>
		<li>Transfer the protein FASTA to the local computer.</li>
		<li>Download and install Blast2GO27,28,29 v. 5.2 to the local computer.</li>
		<li>Open Blast2GO, click File, go to Load, go to Load Sequences, click Load Fasta File (fasta). Select the FASTA file and click Load.</li>
		<li>Click on Blast, choose NCBI Blast, and click Next. Edit parameters or click Next, edit parameters and click Run to find the most similar gene description.</li>
		<li>Click mapping then click Run to search Gene Ontology annotations for similar proteins.</li>
		<li>Next click interpro, select EMBL-EBI InterPro, and click Next. Edit parameters or click Next, and click Run to search for signatures of known gene families and domains.</li>
		<li>Export the annotations by clicking File, select Export, click Export Table. Click Browse, name the file, click Save, click Export.</li>
		<li>Search the annotation table for GO terms of interest to identify additional candidate genes. Extract the sequences from the FASTA file (STEP 8.4.5)</li>
	</ol>
	</li>
</ol>
9. Phylogenetic trees
<ol>
	<li>Download and install MEGA30 v. 7.0.26 to your local computer.</li>
	<li>Open MEGA, click on Align, click Edit/Build Alignment, select Create a new alignment click OK, select Protein.</li>
	<li>When the alignment window opens, click on Edit, click Insert sequences from file and select the FASTA with protein sequences of candidate genes and probable homologs.</li>
	<li>Select all sequences. Find the arm symbol and hover over it. It should say Align sequences using MUSCLE31 algorithm. Click on the arm symbol and then click Align Protein to align the sequences. Edit parameters or click OK to align using default parameters.</li>
	<li>Visually inspect and make any manual changes then Save and close the alignment window.</li>
	<li>In the main MEGA window, click on Models, click Find Best DNA/Protein models (ML), select the alignment file and select corresponding parameters such as: Analysis: Model Selection (ML), Tree to use: Automatic (neighbor-joining tree), Statistical Method: Maximum Likelihood, Substitution Type: Amino Acid, Gap/missing data treatment: Use all sites, Branch site filter: None.</li>
	<li>Once the best model for the data is determined, go to the main MEGA window. Click Phylogeny and click Contruct/Test Maximum Likelihood Tree and then select the alignment, if necessary. Select the appropriate parameters for the tree: Statistical method: Maximum Likelihood, Test of Phylogeny: Bootstrap method with 100 replicates, substitution type: amino acid, model: LG with Freqs. (+F), rates among sites: gamma distributed (G) with 5 discrete gamma categories, gap/missing data treatment: use all sites, ML heuristic method: Nearest-Neighbor-Interchange (NNI).</li>
</ol>
10. Visualize gene expression using TPM
<ol>
	<li>For Trinity, on the computer cluster go to the directory where abundance_estimates_to_matrix.pl was run and one of the outputs should be matrix.TPM.not_cross_norm. Transfer this file to your local computer. 
	NOTE: See Supplemental Materials for cross sample normalization.</li>
	<li>For TPMs from a genome analysis follow the steps below.
	<ol>
		<li>On the computer cluster, go to the RSEM installation location. Copy rsem-generate-data-matrix by typing scp rsem-generate-data-matrix rsem-generate-TPM-matrix. Use nano to edit the new file and change &#8220;my $offsite = 4&#8221; from 4 to 5 for TPM, it should now read &#8220;my $offsite = 5&#8221;.</li>
	</ol>
	</li>
	<li>Go to the directory where the RSEM output files .genes.results are and now use rsem-generate-TPM-matrix *[genes/isoforms.results] &#62; $OUTPUT to generate a TPM matrix. Transfer results to a local computer.</li>
	<li>Visualize the results in ggplot2.
	<ol>
		<li>Download R v. 4.0.0 and RStudio v. 1.2.1335 to a local computer.</li>
		<li>Open RStudio on the right of the screen go to the Packages tab and click Install. Type ggplot2 and click install.</li>
		<li>On the R script window read in the TPM table by typing data&#60;-read.table(&#34;$tpm.txt&#34;,header = T)</li>
		<li>For bar graphs similar to Figure 4&#160;type something similar to: p&#60;- ggplot() + geom_bar(aes(y=TPM, x=Symbol, fill=Tissue), data=data, stat=&#34;identity&#34;) 
		fill&#60;-c(&#34;#d7191c&#34;,&#34;#fdae61&#34;, &#34;#ffffbf&#34;, &#34;#abd9e9&#34;, &#34;#2c7bb6&#34;) 
		p&#60;-p+scale_fill_manual(values=fill) 
		p + theme(axis.text.x = element_text(angle = 90))</li>
	</ol>
	</li>
</ol>

The authors have nothing to disclose.

The purpose of this protocol is to provide an outline of the steps for characterizing a gene family using RNA-seq data. These methods have been proven to work for a variety of species and datasets4,34,35. The pipeline established here has been simplified and should be easy enough to be followed by a novice in bioinformatics. The significance of the protocol is that it outlines all the steps and necessary programs to complete a p...

Advances in sequencing technologies have facilitated the sequencing of genomes and transcriptomes of non-model organisms. In addition to the increased feasibility of sequencing DNA and RNA from many organisms, an abundance of data is publicly available to study genes of interest. The purpose of this protocol is to provide bioinformatic steps to investigate the molecular evolution and expression of genes that may play an important role in the organism of interest.
Investigating the evolution of a gene or gene family can provide insight into the evolution of biological systems. Members of a gene family are typically determined by identifying conserved motifs or homologous gene sequences. Gene family evolution was previously investigated using genomes from distantly related model organisms1. A limitation to this approach is that it is not clear how these gene families evolve in closely related species and the role of different environmental selective pressures. In this protocol, we include a search for homologs in closely related species. By generating a phylogeny at a phylum level, we can note trends in gene family evolution such as that of conserved genes or lineage-specific duplications. At this level, we can also investigate whether genes are orthologs or paralogs. While many homologs likely function similarly to each other, that is not necessarily the case2. Incorporating phylogenetic trees in these studies is important to resolve whether these homologous genes are orthologs or not. In eukaryotes, many orthologs retain similar functions within the cell as evidenced by the ability of mammalian proteins to restore the function of yeast orthologs3. However, there are instances where a non-orthologous gene carries out a characterized function4.
Phylogenetic trees begin to delineate relationships between genes and species, yet function cannot be assigned solely based on genetic relationships. Gene expression studies combined with functional annotations and enrichment analysis&#160;provide strong support for gene function. Cases where gene expression can be quantified and compared across individuals or tissue types can be more telling of potential function. The following protocol follows methods used in investigating opsin genes in Hydra vulgaris7, but they can be applied to any species and any gene family. The results of such studies provide a foundation for further investigation into gene function and gene networks in non-model organisms. As an example, the investigation of the phylogeny of opsins, which are proteins that initiate the phototransduction cascade, gives context to the evolution of eyes and light detection8,9,10,11. In this case, non-model organisms especially basal animal species such as cnidarians or ctenophores can elucidate conservation or changes in the phototransduction cascade and vision across clades12,13,14. Similarly, determining the phylogeny, expression, and networks of other gene families will inform us about the molecular mechanisms underlying adaptations.

<table>
	<tbody>
		<tr>
			<td>Bioanalyzer-DNA kit</td>
			<td>Agilent</td>
			<td>5067-4626</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>Bioanalyzer-RNA kit</td>
			<td>Agilent</td>
			<td>5067-1513</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>BLAST+ v. 2.8.1</td>
			<td></td>
			<td></td>
			<td>On computer cluster* 
			https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/</td>
		</tr>
		<tr>
			<td>Blast2GO (on your PC)</td>
			<td></td>
			<td></td>
			<td>On local computer 
			https://www.blast2go.com/b2g-register-basic</td>
		</tr>
		<tr>
			<td>boost v. 1.57.0</td>
			<td></td>
			<td></td>
			<td>On computer cluster</td>
		</tr>
		<tr>
			<td>Bowtie v. 1.0.0</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.3.0/</td>
		</tr>
		<tr>
			<td>Computing cluster (highly recommended)</td>
			<td></td>
			<td></td>
			<td>NOTE: Analyses of genomic data are best done on a high-performance computing cluster because files are very large.</td>
		</tr>
		<tr>
			<td>Cufflinks v. 2.2.1</td>
			<td></td>
			<td></td>
			<td>On computer cluster</td>
		</tr>
		<tr>
			<td>edgeR v. 3.26.8 (in R)</td>
			<td></td>
			<td></td>
			<td>In Rstudio 
			https://bioconductor.org/packages/release/bioc/html/edgeR.html</td>
		</tr>
		<tr>
			<td>gcc v. 6.4.0</td>
			<td></td>
			<td></td>
			<td>On computer cluster</td>
		</tr>
		<tr>
			<td>Java v. 11.0.2</td>
			<td></td>
			<td></td>
			<td>On computer cluster</td>
		</tr>
		<tr>
			<td>MEGA7 (on your PC)</td>
			<td></td>
			<td></td>
			<td>On local computer 
			https://www.megasoftware.net</td>
		</tr>
		<tr>
			<td>MEGAX v. 0.1</td>
			<td></td>
			<td></td>
			<td>On local computer 
			https://www.megasoftware.net</td>
		</tr>
		<tr>
			<td>NucleoSpin RNA II kit</td>
			<td>Macherey-Nagel</td>
			<td>740955.5</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>perl 5.30.3</td>
			<td></td>
			<td></td>
			<td>On computer cluster</td>
		</tr>
		<tr>
			<td>python</td>
			<td></td>
			<td></td>
			<td>On computer cluster</td>
		</tr>
		<tr>
			<td>Qubit 2.0 Fluorometer</td>
			<td>ThermoFisher</td>
			<td>Q32866</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>R v.4.0.0</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://cran.r-project.org/src/base/R-4/</td>
		</tr>
		<tr>
			<td>RNAlater</td>
			<td>ThermoFisher</td>
			<td>AM7021</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>RNeasy kit</td>
			<td>Qiagen</td>
			<td>74104</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>RSEM v. 1.3.0</td>
			<td></td>
			<td></td>
			<td>Computer software 
			https://deweylab.github.io/RSEM/</td>
		</tr>
		<tr>
			<td>RStudio v. 1.2.1335</td>
			<td></td>
			<td></td>
			<td>On local computer 
			https://rstudio.com/products/rstudio/download/#download</td>
		</tr>
		<tr>
			<td>Samtools v. 1.3</td>
			<td></td>
			<td></td>
			<td>Computer software</td>
		</tr>
		<tr>
			<td>SRA Toolkit v. 2.8.1</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit</td>
		</tr>
		<tr>
			<td>STAR v. 2.6.0c</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://github.com/alexdobin/STAR</td>
		</tr>
		<tr>
			<td>StringTie v. 1.3.4d</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://ccb.jhu.edu/software/stringtie/</td>
		</tr>
		<tr>
			<td>Transdecoder v. 5.5.0</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://github.com/TransDecoder/TransDecoder/releases</td>
		</tr>
		<tr>
			<td>Trimmomatic v. 0.35</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			http://www.usadellab.org/cms/?page=trimmomatic</td>
		</tr>
		<tr>
			<td>Trinity v.2.8.5</td>
			<td></td>
			<td></td>
			<td>On computer cluster 
			https://github.com/trinityrnaseq/trinityrnaseq/releases</td>
		</tr>
		<tr>
			<td>TRIzol</td>
			<td>ThermoFisher</td>
			<td>15596018</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>TruSeq RNA Library Prep Kit v2</td>
			<td>Illumina</td>
			<td>RS-122-2001</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td>TURBO DNA-free Kit</td>
			<td>ThermoFisher</td>
			<td>AM1907</td>
			<td>wet lab materials</td>
		</tr>
		<tr>
			<td></td>
			<td></td>
			<td></td>
			<td></td>
		</tr>
		<tr>
			<td>*Downloads and installation on the computer cluster may require root access. Contact your network administrator.</td>
			<td></td>
			<td></td>
			<td></td>
		</tr>
	</tbody>
</table>

a bioinformatics pipeline for investigating molecular evolution and gene expression using rna-seq

Distilling and reporting large datasets, such as whole genome or transcriptome data, is often a daunting task. One way to break down results is to focus on one or more gene families that are significant to the organism and study. In this protocol, we outline bioinformatic steps to generate a phylogeny and to quantify the expression of genes of interest. Phylogenetic trees can give insight into how genes are evolving within and between species as well as reveal orthology. These results can be enhanced using RNA-seq data to compare the expression of these genes in different individuals or tissues. Studies of molecular evolution and expression can reveal modes of evolution and conservation of gene function between species. The characterization of a gene family can serve as a springboard for future studies and can highlight an important gene family in a new genome or transcriptome paper.

The methods above are summarized in Figure 1 and were applied to a data set of Hydra vulgaris tissues. H. vulgaris is a fresh-water invertebrate that belongs to the phylum Cnidaria which also includes corals, jellyfish, and sea anemones. H. vulgaris can reproduce asexually by budding and they can regenerate their head and foot when bisected. In this study, we aimed to investigate the evolution and expression of opsin genes in Hydra<sup class="xref...

Watch this Scientific Journal Video about A Bioinformatics Pipeline for Investigating Molecular Evolution and Gene Expression using RNA-seq at JoVE.com

A Bioinformatics Pipeline for Investigating Molecular Evolution and Gene Expression using RNA-seq

The purpose of this protocol is to investigate the evolution and expression of candidate genes using RNA sequencing data.

Distilling and reporting large datasets, such as whole genome or transcriptome data, is often a daunting task. One way to break down results is to focus ...

a-bioinformatics-pipeline-for-investigating-molecular-evolution-gene

Research

Education

JoVE Journal

JoVE Core

Cell Biology

Biology

Unit 1

Cells, Genomes, and Evolution

SCIENTISTS IN ACTION

9.2K Views.  University of California, Irvine. The purpose of this protocol is to investigate the evolution and expression of candidate genes using RNA sequencing data.

Video: A Bioinformatics Pipeline for Investigating Molecular Evolution and Gene Expression using RNA-seq

Molecular Evolution of the Tre Recombinase

Here we report the generation of Tre recombinase through directed, molecular evolution. Tre recombinase recognizes a pre-defined target sequence within the LTR sequences of the HIV-1 provirus, resulting in the excision and eradication of the provirus from infected human cells. 
We started with Cre, a 38-kDa recombinase, that recognizes a 34-bp double-stranded DNA sequence known as loxP. Because Cre can effectively eliminate genomic sequences, we set out to tailor a recombinase that could remove the sequence between the 5'-LTR and 3'-LTR of an integrated HIV-1 provirus. As a first step we identified sequences within the LTR sites that were similar to loxP and tested for recombination activity. Initially Cre and mutagenized Cre libraries failed to recombine the chosen loxLTR sites of the HIV-1 provirus. As the start of any directed molecular evolution process requires at least residual activity, the original asymmetric loxLTR sequences were split into subsets and tested again for recombination activity. Acting as intermediates, recombination activity was shown with the subsets. Next, recombinase libraries were enriched through reiterative evolution cycles. Subsequently, enriched libraries were shuffled and recombined. The combination of different mutations proved synergistic and recombinases were created that were able to recombine loxLTR1 and loxLTR2. This was evidence that an evolutionary strategy through intermediates can be successful. After a total of 126 evolution cycles individual recombinases were functionally and structurally analyzed. The most active recombinase -- Tre -- had 19 amino acid changes as compared to Cre. Tre recombinase was able to excise the HIV-1 provirus from the genome HIV-1 infected HeLa cells (see "HIV-1 Proviral DNA Excision Using an Evolved Recombinase", Hauber J., Heinrich-Pette-Institute for Experimental Virology and Immunology, Hamburg, Germany). While still in its infancy, directed molecular evolution will allow the creation of custom enzymes that will serve as tools of "molecular surgery" and molecular medicine.

Here we report the generation of Tre recombinase through directed, molecular evolution. Tre recombinase recognizes a pre-defined target sequence within the LTR sequences of the HIV-1 provirus, resulting in the excision and eradication of the provirus from infected human cells. While still in its infancy, directed molecular evolution will allow the creation of custom enzymes that will serve as tools of molecular surgery and molecular medicine.

Here we report the generation of Tre recombinase through directed, molecular evolution. Tre recombinase recognizes a pre-defined target sequence within ...

The Preparation of Primary Hematopoietic Cell Cultures From Murine Bone Marrow for Electroporation

It is becoming increasingly apparent that electroporation is the most effective way to introduce plasmid DNA or siRNA into primary cells. The Gene Pulser MXcell  electroporation system and Gene Pulser  electroporation buffer were specifically developed to transfect nucleic acids into mammalian cells and difficult-to-transfect cells, such as primary and stem cells.This video demonstrates how to establish primary hematopoietic cell cultures from murine bone marrow, and then prepare them for electroporation in the MXcell system. We begin by isolating femur and tibia. Bone marrow from both femur and tibia are then harvested and cultures are established. Cultured bone marrow cells are then transfected and analyzed.

This procedure describes how to establish primary hematopoietic cell cultures from murine bone marrow and is followed by transfection using the Gene Pulser MXCell electroporation system.

It is becoming increasingly apparent that electroporation is the most effective way to introduce plasmid DNA or siRNA into primary cells. The Gene Pulser ...

Using the Gene Pulser MXcell Electroporation System to Transfect Primary Cells with High Efficiency

It is becoming increasingly apparent that electroporation is the most effective way to introduce plasmid DNA or siRNA into primary cells. The Gene Pulser MXcell electroporation system and Gene Pulser electroporation buffer (Bio-Rad) were specifically developed to easily transfect nucleic acids into mammalian cells and difficult-to-transfect cells, such as primary and stem cells. We will demonstrate how to perform a simple experiment to quickly identify the best electroporation conditions. We will demonstrate how to run several samples through a range of electroporation conditions so that an experiment can be conducted at the same time as optimization is performed. We will also show how optimal conditions identified using 96-well electroporation plates can be used with standard electroporation cuvettes, facilitating the switch from electroporation plates to electroporation cuvettes while maintaining the same electroporation efficiency. In the video, we will also discuss some of the key factors that can lead to the success or failure of electroporation experiments.

This procedure shows how to use the Gene Pulser MXcell electroporation system to rapidly and easily identify the best electroporation conditions for mouse embryonic fibroblasts (MEFs) or other primary cells. Considerations for troubleshooting are also discussed in the associated video.

Using an Automated Cell Counter to Simplify Gene Expression Studies: siRNA Knockdown of IL-4 Dependent Gene Expression in Namalwa Cells

The use of siRNA mediated gene knockdown is continuing to be an important tool in studies of gene expression.  siRNA studies are being conducted not only to study the effects of downregulating single genes, but also to interrogate signaling pathways and other complex interaction networks.  These pathway analyses require both the use of relevant cellular models and methods that cause less perturbation to the cellular physiology.  Electroporation is increasingly being used as an effective way to introduce siRNA and other nucleic acids into difficult to transfect cell lines and primary cells without altering the signaling pathway under investigation. There are multiple critical steps to a successful siRNA experiment, and there are ways to simplify the work while improving the data quality at several experimental stages.  To help you get started with your siRNA mediated gene knockdown project, we will demonstrate how to perform a pathway study complete from collecting and counting the cells prior to electroporation through post transfection real-time PCR gene expression analysis.  The following study investigates the role of the transcriptional activator STAT6 in IL-4 dependent gene expression of CCL17 in a Burkitt lymphoma cell line (Namalwa).  The techniques demonstrated are useful for a wide range of siRNA-based experiments on both adherent and suspension cells.  We will also show how to streamline cell counting with the TC10 automated cell counter, how to electroporate multiple samples simultaneously using the MXcell electroporation system, and how to simultaneously assess RNA quality and quantity with the Experion automated electrophoresis system.

This procedure describes a quick and easy workflow to introduce siRNA into difficult to transfect cell lines and follow gene expression by real-time PCR. Use of an automated cell counter, multi-well electroporation plate, and automated electrophoresis station provide quick and reliable results without the need for expensive robotic handling.

The use of siRNA mediated gene knockdown is continuing to be an important tool in studies of gene expression.  siRNA studies are being conducted not only ...

Paraffin-Embedded and Frozen Sections of Drosophila Adult Muscles

The molecular characterization of muscular dystrophies and myopathies in humans has revealed the complexity of muscle disease and genetic analysis of muscle specification, formation and function in model systems has provided valuable insight into muscle physiology. Therefore, identifying and characterizing molecular mechanisms that underlie muscle damage is critical. The structure of adult Drosophila multi-fiber muscles resemble vertebrate striated muscles 1 and the genetic tractability of Drosophila has made it a great system to analyze dystrophic muscle morphology and characterize the processes affecting muscular function in ageing adult flies 2. Here we present the histological technique for preparing paraffin-embedded and frozen sections of Drosophila thoracic muscles. These preparations allow for the tissue to be stained with classical histological stains and labeled with protein detecting dyes, and specifically cryosections are ideal for immunohistochemical detection of proteins in intact muscles. This allows for analysis of muscle tissue structure, identification of morphological defects, and detection of the expression pattern for muscle/neuron-specific proteins in Drosophila adult muscles. These techniques can also be slightly modified for sectioning of other body parts.

Paraffin-Embedded and Frozen Sections of Drosophila Adult Muscles

Identification of mechanisms underlying muscle damage is crucial. Here we present the histological technique for preparing paraffin-embedded and frozen sections of Drosophila thoracic muscles. This allows analysis of muscle morphology and localization of protein and other muscle cell components.

The molecular characterization of muscular dystrophies and myopathies in humans has revealed the complexity of muscle disease and genetic analysis of ...

In vitro Reconstitution of the Active T. castaneum Telomerase

Efforts to isolate the catalytic subunit of telomerase, TERT, in sufficient quantities for structural studies, have been met with limited success for more than a decade. Here, we present methods for the isolation of the recombinant Tribolium castaneum TERT (TcTERT) and the reconstitution of the active T. castaneum telomerase ribonucleoprotein (RNP) complex in vitro.
Telomerase is a specialized reverse transcriptase1 that adds short DNA repeats, called telomeres, to the 3' end of linear chromosomes2 that serve to protect them from end-to-end fusion and degradation. Following DNA replication, a short segment is lost at the end of the chromosome3 and without telomerase, cells continue dividing until eventually reaching their Hayflick Limit4. Additionally, telomerase is dormant in most somatic cells5 in adults, but is active in cancer cells6 where it promotes cell immortality7.
The minimal telomerase enzyme consists of two core components: the protein subunit (TERT), which comprises the catalytic subunit of the enzyme and an integral RNA component (TER), which contains the template TERT uses to synthesize telomeres8,9. Prior to 2008, only structures for individual telomerase domains had been solved10,11. A major breakthrough in this field came from the determination of the crystal structure of the active12, catalytic subunit of T. castaneum telomerase, TcTERT1.
Here, we present methods for producing large quantities of the active, soluble TcTERT for structural and biochemical studies, and the reconstitution of the telomerase RNP complex in vitro for telomerase activity assays. An overview of the experimental methods used is shown in Figure 1.

In vitro Reconstitution of the Active T. castaneum Telomerase

Efforts to isolate the catalytic subunit of telomerase, TERT, in sufficient quantities for structural studies, have been met with limited success for more than a decade. Here, we present methods for the isolation of the recombinant Tribolium castaneum TERT (TcTERT) and the reconstitution of the active T. castaneum telomerase ribonucleoprotein (RNP) complex in vitro.

Efforts to isolate the catalytic subunit of telomerase, TERT, in sufficient quantities for structural studies, have been met with limited success for more ...

A Protocol for Computer-Based Protein Structure and Function Prediction

Genome sequencing projects have ciphered millions of protein sequence, which require knowledge of their structure and function to improve the understanding of their biological role. Although experimental methods can provide detailed information for a small fraction of these proteins, computational modeling is needed for the majority of protein molecules which are experimentally uncharacterized. The I-TASSER server is an on-line workbench for high-resolution modeling of protein structure and function. Given a protein sequence, a typical output from the I-TASSER server includes secondary structure prediction, predicted solvent accessibility of each residue, homologous template proteins detected by threading and structure alignments, up to five full-length tertiary structural models, and structure-based functional annotations for enzyme classification, Gene Ontology terms and protein-ligand binding sites. All the predictions are tagged with a confidence score which tells how accurate the predictions are without knowing the experimental data. To facilitate the special requests of end users, the server provides channels to accept user-specified inter-residue distance and contact maps to interactively change the I-TASSER modeling; it also allows users to specify any proteins as template, or to exclude any template proteins during the structure assembly simulations. The structural information could be collected by the users based on experimental evidences or biological insights with the purpose of improving the quality of I-TASSER predictions. The server was evaluated as the best programs for protein structure and function predictions in the recent community-wide CASP experiments. There are currently &gt;20,000 registered scientists from over 100 countries who are using the on-line I-TASSER server.

Guidelines for computer based structural and functional characterization of protein using the I-TASSER pipeline is described. Starting from query protein sequence, 3D models are generated using multiple threading alignments and iterative structural assembly simulations. Functional inferences are thereafter drawn based on matches to proteins with known structure and functions.

Genome sequencing projects have ciphered millions of protein sequence, which require knowledge of their structure and function to improve the ...

Analysis of Global RNA Synthesis at the Single Cell Level following Hypoxia

Hypoxia or lowering of the oxygen availability is involved in many physiological and pathological processes. At the molecular level, cells initiate a particular transcriptional program in order to mount an appropriate and coordinated cellular response. The cell possesses several oxygen sensor enzymes that require molecular oxygen as cofactor for their activity. These range from prolyl-hydroxylases to histone demethylases. The majority of studies analyzing cellular responses to hypoxia are based on cellular populations and average studies, and as such single cell analysis of hypoxic cells are seldom performed. Here we describe a method of analysis of global RNA synthesis at the single cell level in hypoxia by using Click-iT RNA imaging kits in an oxygen controlled workstation, followed by microscopy analysis and quantification.&nbsp; Using cancer cells exposed to hypoxia for different lengths of time, RNA is labeled and measured in each cell. This analysis allows the visualization of temporal and cell-to-cell changes in global RNA synthesis following hypoxic stress.

We describe a technique for analysis of global RNA synthesis in hypoxia using imaging. Click-chemistry labeling of RNA has not previously been performed under hypoxia and allows visualization of global RNA changes at the single cell level. This approach complements the existing averaged RNA techniques, allowing direct visualization of cell-to-cell changes in global RNA synthesis.

Hypoxia or lowering of the oxygen availability is involved in many physiological and pathological processes. At the molecular level, cells initiate a ...

Measurement of Metabolic Rate in Drosophila using Respirometry

Metabolic disorders are a frequent problem affecting human health. Therefore, understanding the mechanisms that regulate metabolism is a crucial scientific task. Many disease causing genes in humans have a fly homologue, making Drosophila a good model to study signaling pathways involved in the development of different disorders. Additionally, the tractability of Drosophila simplifies genetic screens to aid in identifying novel therapeutic targets that may regulate metabolism. In order to perform such a screen a simple and fast method to identify changes in the metabolic state of flies is necessary. In general, carbon dioxide production is a good indicator of substrate oxidation and energy expenditure providing information about metabolic state. In this protocol we introduce a simple method to measure CO2 output from flies. This technique can potentially aid in the identification of genetic perturbations affecting metabolic rate.

Measurement of Metabolic Rate in Drosophila using Respirometry

Metabolic disorders are among one of the most common diseases in humans. The genetically tractable model organism D. melanogaster can be used to identify novel genes that regulate metabolism. This paper describes a relatively simple method which allows studying the metabolic rate in flies by measuring their CO2 production.

Metabolic disorders are a frequent problem affecting human health. Therefore, understanding the mechanisms that regulate metabolism is a crucial ...

An Experimental and Bioinformatics Protocol for RNA-seq Analyses of Photoperiodic Diapause in the Asian Tiger Mosquito, Aedes albopictus

Photoperiodic diapause is an important adaptation that allows individuals to escape harsh seasonal environments via a series of physiological changes, most notably developmental arrest and reduced metabolism. Global gene expression profiling via RNA-Seq can provide important insights into the transcriptional mechanisms of photoperiodic diapause. The Asian tiger mosquito, Aedes albopictus, is an outstanding organism for studying the transcriptional bases of diapause due to its ease of rearing, easily induced diapause, and the genomic resources available. This manuscript presents a general experimental workflow for identifying diapause-induced transcriptional differences in A. albopictus. Rearing techniques, conditions necessary to induce diapause and non-diapause development, methods to estimate percent diapause in a population, and RNA extraction and integrity assessment for mosquitoes are documented. A workflow to process RNA-Seq data from Illumina sequencers culminates in a list of differentially expressed genes. The representative results demonstrate that this protocol can be used to effectively identify genes differentially regulated at the transcriptional level in A. albopictus due to photoperiodic differences. With modest adjustments, this workflow can be readily adapted to study the transcriptional bases of diapause or other important life history traits in other mosquitoes.

An Experimental and Bioinformatics Protocol for RNA-seq Analyses of Photoperiodic Diapause in the Asian Tiger Mosquito, Aedes albopictus

RNA-Seq analyses are becoming increasingly important for identifying the molecular underpinnings of adaptive traits in non-model organisms. Here, a protocol to identify differentially expressed genes between diapause and non-diapause Aedes albopictus mosquitoes is described, from mosquito rearing, to RNA sequencing and bioinformatics analyses of RNA-Seq data.

Photoperiodic diapause is an important adaptation that allows individuals to escape harsh seasonal environments via a series of physiological changes, ...