This protocol outlines bioinformatic steps for investigating the molecular evolution and expression of candidate genes. Here we provide thorough instructions so that anybody with minimal bioinformatic experience can run through this protocol. This pipeline can be applied to any organism and any gene family.
One common issue when doing bioinformatics is shell scripts failing. When attempting this protocol, make sure you have the most up-to-date software, read the error files, and check the manual carefully. To begin, log in to the computer cluster account on a terminal or PuTTY application window.
On the terminal, download SRA Toolkit version 2.8.1 using Wget, then finish installing the program. Search NCBI for the SRA accession number for the desired samples, then obtain the RNA sequence data in the terminal window. Obtain two FASTQ files for paired-end files type.
Find the reference genome online if one exists. To obtain a reference assembly, type wget in the terminal window, and paste the link address. If available, also copy the GTF file and protein FASTA file for the reference genome.
Index the genome, then map reads and calculate expression for each sample. Rename the results file to something descriptive, and generate a matrix of all counts. Open an internet browser window and go to NCBI GenBank.
In the search bar, type the name of the gene of interest and the name of closely-related species which have been sequenced. On the left of the search bar, select Protein, then click Search. Extract the sequences by clicking Send to, and then select File.
Under Format, select FASTA, then click Create File. Move FASTA file of homologues to the computer cluster using a local terminal window or FileZilla. Next, search for candidate genes using BLAST+On the computer cluster, make a BLAST database from the genome or transcriptome-translated protein, FASTA.
BLAST the homologous gene sequences from NCBI to the database of the species of interest, then view the output file using the command more. Copy unique gene IDs from the species of interest to a new text file. Extract the sequences of candidate genes.
To confirm gene annotation using Reciprocal BLAST, go to the BLAST Local Alignment Search Tool, select BLASTP, then paste the candidate sequences, select the non-redundant protein sequence database, and click BLAST. Open MEGA, click on Align, then Edit Build Alignment, select Create a new alignment, and click OK.Select Protein. When the Alignment window opens, click on Edit.
Click Insert Sequences From File, and select the FASTA with protein sequences of candidate genes and probable homologues. Select All sequences. Find the arm symbol and hover over it.
It should say align sequences using muscle algorithm. Click on the arm symbol, and then click Align Protein to align the sequences Edit parameters or click OK to use default parameters. This protocol was applied to tissues of Hydra vulgaris which is a freshwater invertebrate that belongs to the phylum Cnidaria.
Opsin genes were investigated to gain insight into the evolution of eyes and light detection in animals. Sequences for opsin-related genes of H.vulgaris and other species were extracted into a FASTA file from the NCBI GenBank. The opsin genes were aligned in MEGA, making it possible to identify Hydra opsins that were missing a conserved-lysine amino acid necessary to bind a light-sensitive molecule.
A maximum-likelihood tree was generated using opsin sequences from Hydra vulgaris and other species. The phylogeny suggests opsin genes are evolving by lineage-specific duplications in cnidarians, and potentially by tandem duplication in H.vulgaris. Next, a differential expression analysis was performed in edgeR to investigate absolute expression of opsin genes.
To determine whether one or more opsins are up-regulated in the hypostome, or head, pair-wise comparisons of hypostome versus the body column, budding zone, foot, and tentacles were performed. It was found that 1, 774 transcripts were differentially expressed between the hypostome and body column. The genes that were up-regulated across multiple comparisons were determined, and a functional enrichment in Blast2GO was performed.
Finally, the absolute expression of opsin genes was investigated in different tissues during different stages of budding, and during different time points of regeneration. Visual inspection of the alignment and tree will confirm whether candidate genes belong to the family of interest. Genes that are too different in sequence or a group outside of everything else, are probably part of a different gene family.
Results from this protocol can be considered hypothesis-generating. This pipeline can highlight candidate genes to study functionally in future studies. After exploring Hydra opsin expression, we are now using similar techniques to investigate related genes across species in order to identify similarities and differences in function.