A Practical Guide to Phylogenetics for Nonexperts

Damien O'Halloran

doi:10.3791/50975

Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.

W tym Artykule

Podsumowanie
Streszczenie
Wprowadzenie
Protokół
Wyniki
Dyskusje
Ujawnienia
Podziękowania
Materiały
Odniesienia
Przedruki i uprawnienia

Podsumowanie

Here we describe a step-by-step pipeline for generating reliable phylogenies from nucleotide or amino acid sequence datasets. This guide aims to serve researchers or students new to phylogenetic analysis.

Streszczenie

Many researchers, across incredibly diverse foci, are applying phylogenetics to their research question(s). However, many researchers are new to this topic and so it presents inherent problems. Here we compile a practical introduction to phylogenetics for nonexperts. We outline in a step-by-step manner, a pipeline for generating reliable phylogenies from gene sequence datasets. We begin with a user-guide for similarity search tools via online interfaces as well as local executables. Next, we explore programs for generating multiple sequence alignments followed by protocols for using software to determine best-fit models of evolution. We then outline protocols for reconstructing phylogenetic relationships via maximum likelihood and Bayesian criteria and finally describe tools for visualizing phylogenetic trees. While this is not by any means an exhaustive description of phylogenetic approaches, it does provide the reader with practical starting information on key software applications commonly utilized by phylogeneticists. The vision for this article would be that it could serve as a practical training tool for researchers embarking on phylogenetic studies and also serve as an educational resource that could be incorporated into a classroom or teaching-lab.

Wprowadzenie

In order to understand how two (or more) species evolved, it is first necessary to obtain sequence or morphological data from each sample; these data represent quantities that we can use to measure their relationship through evolutionary space. Just like when measuring linear distance, having more data available (e.g. miles, inches, microns) will equate to a more accurate measurement. Ergo, the accuracy with which a researcher can deduce evolutionary distance is heavily influenced by the volume of informative data available to measure relationships. Furthermore, because different samples evolve at different rates and by different mechanisms, the method that we use to measure the relationship between two taxa also directly influences the accuracy of evolutionary measurements. Therefore, because evolutionary relationships are not directly observed but instead are extrapolated from sequence or morphological data, the problem of inferring evolutionary relationships becomes one of statistics. Phylogenetics is the branch of biology concerned with applying statistical models to patterns of evolution in order to optimally reconstruct the evolutionary history between taxa. This reconstruction between taxa is referred to as the taxa’s phylogeny.

To help bridge the gap in expertise between molecular biologists and evolutionary biologists we describe here a step by step pipeline for inferring phylogenies from a set of sequences. Firstly, we detail the steps involved in database interrogation using the Basic Local Alignment Search Tool (BLAST¹) algorithm through the web based interface and also by using local executables; this is often the first step in obtaining a list of similar sequences to an unidentified query, although some researchers may also be interested in gathering data for a single group via web interfaces such as Phylota (http://www.phylota.net/). BLAST is an algorithm for comparing primary amino acid or nucleotide sequence data against a database of sequences to search for “hits” that resemble the query sequence. The BLAST program was designed by Stephen Altschul et al. at the National Institutes of Health (NIH)¹. The BLAST server consists of a number of different programs, and here is a list of some of the most common BLAST programs:

i) Nucleotide-nucleotide BLAST (blastn): This program requires a DNA sequence input and returns the most similar DNA sequences from the DNA database that the user specifies (e.g. for a specific organism).

ii) Protein-protein BLAST (blastp): Here the user inputs a protein sequence and the program returns the most similar protein sequences from the protein database that the user specifies.

iii) Position-Specific Iterative BLAST (PSI-BLAST) (blastpgp): The user input is a protein sequence which returns a set of closely related proteins, and from this dataset a conserved profile is generated. Next a new query is generated using only these conserved “motifs” which is used to interrogate a protein database and this returns a larger group of proteins from which a new set of conserved “motifs” are extracted and then used to interrogate a protein database until an even larger set of proteins are retuned and another profile is generated and the process repeated. By including related proteins into the query in each step this program allows the user to identify sequences that are more divergent.

iv) Nucleotide 6-frame translation-protein (blastx): Here the user provides a nucleotide sequence input which is converted into the six-frame conceptual translation products (i.e. both strands) against a protein sequence database.

v) Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): This program takes a DNA nucleotide sequence input and translates the input into all six-frame conceptual translation products which it compares against the six-frame translations of a nucleotide sequence database.

vi) Protein-nucleotide 6-frame translation (tblastn): This program uses a protein sequence input to compare against all six reading frames of a nucleotide sequence database.

Next, we describe commonly used programs for generating a Multiple Sequence Alignment (MSA) from a sequence dataset, and this is followed by a user guide to programs that determine the best-fit models of evolution for a sequence dataset. Phylogenetic reconstruction is a statistical problem, and because of this, phylogenetic methods need to incorporate a statistical framework. This statistical framework becomes an evolutionary model that incorporates sequence change within the dataset. This evolutionary model is comprised of a set of assumptions about the process of nucleotide or amino-acid substitutions, and the best-fit model for a particular dataset can be selected through statistical testing. The fit to the data of different models can be compared via likelihood ratio tests (LRTs) or information criteria to select the best-fit model within a set of possible ones. Two common information criteria are the Akaike information criterion (AIC)² and the Bayesian information criterion (BIC)³. Once an optimal alignment is generated, there are many different methods to create a phylogeny from the aligned data. There are numerous methods of inferring evolutionary relationships; broadly, they can be divided into two categories: distance-based methods and sequence-based methods. Distance-based methods compute pairwise distances from sequences, and then use these distances to obtain the tree. Sequence-based methods use the sequence alignment directly, and usually search the tree space using an optimality criterion. We outline two sequence-based methods for reconstructing phylogenetic relationships: these are PhyML⁴ which implements the maximum likelihood framework, and MrBayes⁵ which uses Bayesian Markov Chain Monte Carlo inference. Likelihood and Bayesian methods provide a statistical framework for phylogenetic reconstruction. By providing user information on commonly used tree-building tools, we introduce the reader to the necessary data required to infer phylogenetic relationships.

Protokół

1. Basic Local Alignment Search Tool (BLAST): Online Interface

Click on this link to visit the BLAST¹ web server at the National Center for Biotechnology Information (NCBI). - http://blast.ncbi.nlm.nih.gov/Blast.cgi (Figure 1).
Input a FASTA formatted text sequence (see Figure 2 for example) into the query box.
Click the appropriate BLAST program and relevant database or individual species of interest to use in the search and then click “BLAST”.
Note: FASTA formatted sequence begins with a description line indicated by a “>” sign. The description must follow immediately after the “>” sign, the sequence (i.e. nucleotides or amino acids) follow the description on the next line. The output from the BLAST search is viewed as HTML, plain text, XML, or hit tables (Text or csv) with the default set to HTML (Figure 3).

2. Basic Local Alignment Search Tool (BLAST): Local Executables

Download the latest BLAST command-line BLAST executables from this link:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ -
em>For PC users: double-click the latest blast win32.exe file and accept the license agreement and click install.
Note: The default installation directory is C:\ncbi-blast-2.2.27+.
Configure the PC environment variable as follows:
1. Click the PC “start” button, and then right click “computer”,
2. Click “Properties” and in the pop-up click on the “advanced” tab
3. Click the “Environment Variables button” and in the new pop-up click the “new” button under the “User variables for user” section
4. In the pop-up add the variable name “Path” and variable value “C:\ncbi-blast-2.2.27+\bin.
  Note: the bin directory contains the executable (i.e. blastp, etc.).
em>For Mac users: Open the Terminal application (to do this just open “Finder” and search “Terminal” and this will display the “terminal” icon). Into the terminal window type:
>ftp ftp.ncbi.nih.gov
Note: can also type the URL used above in the example for PC
To access the NCBI ftp site type “anonymous” for Name and Password, and then type:
>cd blast/executables/LATEST
List the executables by typing:
>ls
Get the latest version by typing the following (or whatever the latest version currently is):
>get ncbi-blast-2.2.7-macosx.tar.gz
Exit the NCBI ftp server site by typing “exit”.
Decompress the downloaded files by typing:
>tar -xzf ncbi-blast-2.2.7-macosx.tar.gz
Add the location of the binaries for the blast executable to your path so that the shell can search through this directory when looking for commands by typing:
>PATH=$PATH:new_folder_location
Check if this added the location to your path by typing:
>echo $PATH
Download a preformatted BLAST databases (which are updated daily) by clicking here:
ftp://ftp.ncbi.nlm.nih.gov/blast/db/
Place the database into the “db” folder.
em>On a PC: open a MS-DOS prompt (to do this click “start” and type “cmd” in the search bar) and change the directory to the ncbi-blast folder by typing:
C:\Users>cd ..\ [moves up one folder]
C:\>cd ncbi-blast-2.2.27+
This will change the directory to:
C:\ncbi-blast-2.2.27+>
Create the database using the following “makedb” command:
>makedb –in db/briggsae.fasta –dbtype prot –out db/briggsae
Note: In the example below (Figure 4) the database is named “briggsae” and is comprised of one linkage group from the organism Caenorhabditis briggsae.
Create a query protein sequence called “test” by inserting a FASTA formatted protein text sequence into the “db” folder.
Interrogate the database via a blastp search by typing the following command:
>blastp –query db/test.txt –db db/briggsae –out text.txt
em>On a Mac: download a database for local Blast searches by accessing the NCBI ftp website as per the instructions above (step 2.4) and then type:
>lcd ../databases/
Download the genome or sequence of interest by typing:
>get NC_[Accession #].fna
Note: “.fna” refers to the FASTA formatted nucleotide sequence and “.faa” refers to the FASTA formatted amino acid sequences.
Type “quit” to exit the ftp site.
Make the database by typing:
>makeblastdb -in db/mouse.faa -out mouse -dbtype prot
Insert a FAST formatted query sequence into the “bin” folder and interrogate the database with the following command:
> blastp -query “your query.fasta” -db ”your database” -out results.txt

3. Generating Multiple Sequence Alignments

Click on these links to access commonly used Multiple Sequence Alignment (MSA) programs:
ClustalW⁶ http://www.clustal.org/
Kalign⁷ http://msa.sbc.su.se/cgi-bin/msa.cgi
MAFFT^8,9 http://mafft.cbrc.jp/alignment/software/
MUSCLE¹⁰ http://www.drive5.com/muscle/
T-Coffee¹¹ http://www.tcoffee.org/Projects/tcoffee/
PROBCONS¹² http://toolkit.tuebingen.mpg.de/probcons
Click on this link - http://tcoffee.crg.cat/apps/tcoffee/do:regular - and input FASTA formatted sequence data into the query box
Note: A sample output from T-Coffee can be seen in Figure 5, similar residues are color coded.
Download the Clustal MSA as a command line version (ClustalW) or a graphical version (ClustalX) by clicking this link: http://www.clustal.org/clustal2/ - then click on the appropriate executable (i.e. win, Linux, Mac OS X).
Upload data as FASTA formatted sequence text and align (Figure 6).

4. Determining Best-fit Models of Evolution

Click here to download the ProtTest¹³ program:
http://darwin.uvigo.es/our-software/
Once ProtTest is downloaded, double-click on the ProtTest.jar file
Once ProtTest is launched, click on “select file” and load the sequence data (Figure 7).
Then click “start” and the program will begin (Figure 8).
Note: After completion of the run (Figure 8), the program will indicate the best model based on criteria e.g. “Best model according to AIC: WAG+I+G”

5. Inferring Sequence Based Phylogenies by Maximum Likelihood or Bayesian Inference

Downloaded PhyML⁴ here:
https://code.google.com/p/phyml/
Launch the executable by double clicking the appropriate application (i.e. phyml Windows, phyml Linux, etc.) and the interface window will pop up (Figure 9).
Load the input sequence as a PHYLIP formatted sequence by typing:
>”file name”.phy
Note: To convert between sequence formats, use the “Readseq” web program available at - http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi.
Launch the program by typing “Y”.
Download MrBayes⁵ here:
http://mrbayes.sourceforge.net/download.php
To start the program click on the executable file and read NEXUS formatted sequence data into the program by typing:
>execute “file name”.nex
Set the evolutionary model.
Select the number of generations to run by typing:
>mcmcp ngen = 1000000 [this sets the number of generations to 1000000]
>sump burnin =10000 [this sets the burnin to 10000]
Save the branch lengths in the results file by typing:
>mcmcp savebrlens = yes
Run the analysis by typing:
>mcmc
Summarize the trees using the “sumt” command.

6. Visualizing Phylogenies

View a list of tree viewer programs here:
http://www.treedyn.org/overview/editors.html
Download the TreeView¹⁴ program here:
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

Wyniki

Finding similarities to a query allows researchers to ascribe a potential identity to new sequences and also infer relationships between sequences. The file input type for BLAST¹ is FASTA formatted text sequence or GenBank accession number. FASTA formatted sequence begins with a description line indicated by a “>” sign (Figure 2). The description must follow immediately after the “>” sign, the sequence (i.e. nucleotides or amino acids) follow the descri...

Dyskusje

Our hope for this article is that it will serve as a starting point to guide researchers or students that are new to phylogenetics. Genome sequencing projects have become less expensive over the last few years and as a consequence the user demand for this technology is increasing, and now the production of large sequence datasets is commonplace in small labs. These datasets often provide researchers with sets of genes that require a phylogenetic framework to begin to understand their function. Furthermore, because phylog...

Ujawnienia

We have nothing to disclose.

Podziękowania

We thank members of the O’Halloran lab for comments on the manuscript. We thank The George Washington University Department of Biological Sciences and Columbian College of Arts and Sciences for Funding to D. O’Halloran.

Materiały

Name	Company	Catalog Number	Comments
BLAST webpage			http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST executables			ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Preformatted BLAST databases			ftp://ftp.ncbi.nlm.nih.gov/blast/db/
Clustal			http://www.clustal.org/
Kalign			http://msa.sbc.su.se/cgi-bin/msa.cgi
MAFFT			http://mafft.cbrc.jp/alignment/software/
MUSCLE			http://www.drive5.com/muscle/
T-Coffee			http://www.tcoffee.org/Projects/tcoffee/
PROBCONS			http://toolkit.tuebingen.mpg.de/probcons
Se-Al			http://tree.bio.ed.ac.uk/software/seal/
BSEdit			http://www.bsedit.org/
JalView			http://www.jalview.org/
SeaView			http://pbil.univ-lyon1.fr/software/seaview.html
ProtTest			https://code.google.com/p/prottest3/
Java Runtime			http://www.java.com/en/download/chrome.jsp
Readseq			http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi
jModelTest			https://code.google.com/p/jmodeltest2/
PhyML			https://code.google.com/p/phyml/
MrBayes			http://mrbayes.sourceforge.net/download.php
TreeView			http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
TreeDyn			http://www.treedyn.org/

Odniesienia

Altschul, S. F., Carroll, R. J., Lipman, D. J. Weights for data related by a tree. J. Mol. Biol. 207 (4), 647-653 (1989).
Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19 (6), 706-723 (1974).
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6 (2), 461-464 (1978).
Guindon, S., Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52 (5), 696-704 (2003).
Huelsenbeck, J. P., Ronquist, F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 17 (8), 754-755 (2001).
Thompson, J. D., Higgins, D. G., Gibson, T. J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 (22), 4673-4680 (1994).
Lassmann, T., Sonnhammer, E. L. Kalign--an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics. 6, 298 (2005).
Katoh, K., Kuma, K., Toh, H., Miyata, T. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33 (2), 511-518 (2005).
Katoh, K., Misawa, K., Kuma, K., Miyata, T. MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res. 30 (14), 3059-3066 (2002).
Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 (5), 1792-1797 (2004).
Notredame, C., Higgins, D. G., Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 (1), 205-217 (2000).
Do, C. B., Mahabhashyam, M. S., Brudno, M., Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15 (2), 330-340 (2005).
Darriba, D., Taboada, G. L., Doallo, R., Posada, D. ProtTest 3: Fast selection of best-fit models of protein evolution. Bioinformatics. 27 (8), 1164-1165 (2011).
Page, R. D. TreeView: An application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12 (4), 357-358 (1996).
Darriba, D., Taboada, G. L., Doallo, R., Posada, D. jModelTest 2: More models, new heuristics and parallel computing. Nat. Methods. 9 (8), 772 (2012).
Chevenet, F., Brun, C., Banuls, A. L., Jacq, B., Christen, R. TreeDyn: Towards dynamic graphics and annotations for analyses of trees. BMC Bioinformatics. 7, 439 (2006).

Przedruki i uprawnienia

Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE

Zapytaj o uprawnienia

Przeglądaj więcej artyków

Phylogenetics Phylogenetic Analysis Sequence Alignment Evolutionary Models Maximum Likelihood Bayesian Inference Phylogenetic Tree Visualization Bioinformatics Tools Practical Guide Nonexpert Researchers

This article has been published

Video Coming Soon

Keep me updated: