JoVE Logo

Zaloguj się

Aby wyświetlić tę treść, wymagana jest subskrypcja JoVE. Zaloguj się lub rozpocznij bezpłatny okres próbny.

W tym Artykule

  • Podsumowanie
  • Streszczenie
  • Wprowadzenie
  • Protokół
  • Wyniki
  • Dyskusje
  • Ujawnienia
  • Podziękowania
  • Materiały
  • Odniesienia
  • Przedruki i uprawnienia

Podsumowanie

Here we describe a step-by-step pipeline for generating reliable phylogenies from nucleotide or amino acid sequence datasets. This guide aims to serve researchers or students new to phylogenetic analysis. 

Streszczenie

Many researchers, across incredibly diverse foci, are applying phylogenetics to their research question(s). However, many researchers are new to this topic and so it presents inherent problems. Here we compile a practical introduction to phylogenetics for nonexperts. We outline in a step-by-step manner, a pipeline for generating reliable phylogenies from gene sequence datasets. We begin with a user-guide for similarity search tools via online interfaces as well as local executables. Next, we explore programs for generating multiple sequence alignments followed by protocols for using software to determine best-fit models of evolution. We then outline protocols for reconstructing phylogenetic relationships via maximum likelihood and Bayesian criteria and finally describe tools for visualizing phylogenetic trees. While this is not by any means an exhaustive description of phylogenetic approaches, it does provide the reader with practical starting information on key software applications commonly utilized by phylogeneticists. The vision for this article would be that it could serve as a practical training tool for researchers embarking on phylogenetic studies and also serve as an educational resource that could be incorporated into a classroom or teaching-lab.

Wprowadzenie

In order to understand how two (or more) species evolved, it is first necessary to obtain sequence or morphological data from each sample; these data represent quantities that we can use to measure their relationship through evolutionary space. Just like when measuring linear distance, having more data available (e.g. miles, inches, microns) will equate to a more accurate measurement. Ergo, the accuracy with which a researcher can deduce evolutionary distance is heavily influenced by the volume of informative data available to measure relationships. Furthermore, because different samples evolve at different rates and by different mechanisms, the method that we use to measure the relationship between two taxa also directly influences the accuracy of evolutionary measurements. Therefore, because evolutionary relationships are not directly observed but instead are extrapolated from sequence or morphological data, the problem of inferring evolutionary relationships becomes one of statistics. Phylogenetics is the branch of biology concerned with applying statistical models to patterns of evolution in order to optimally reconstruct the evolutionary history between taxa. This reconstruction between taxa is referred to as the taxa’s phylogeny.

To help bridge the gap in expertise between molecular biologists and evolutionary biologists we describe here a step by step pipeline for inferring phylogenies from a set of sequences. Firstly, we detail the steps involved in database interrogation using the Basic Local Alignment Search Tool (BLAST1) algorithm through the web based interface and also by using local executables; this is often the first step in obtaining a list of similar sequences to an unidentified query, although some researchers may also be interested in gathering data for a single group via web interfaces such as Phylota (http://www.phylota.net/). BLAST is an algorithm for comparing primary amino acid or nucleotide sequence data against a database of sequences to search for “hits” that resemble the query sequence. The BLAST program was designed by Stephen Altschul et al. at the National Institutes of Health (NIH)1. The BLAST server consists of a number of different programs, and here is a list of some of the most common BLAST programs:

i) Nucleotide-nucleotide BLAST (blastn): This program requires a DNA sequence input and returns the most similar DNA sequences from the DNA database that the user specifies (e.g. for a specific organism).

ii) Protein-protein BLAST (blastp): Here the user inputs a protein sequence and the program returns the most similar protein sequences from the protein database that the user specifies.

iii) Position-Specific Iterative BLAST (PSI-BLAST) (blastpgp): The user input is a protein sequence which returns a set of closely related proteins, and from this dataset a conserved profile is generated. Next a new query is generated using only these conserved “motifs” which is used to interrogate a protein database and this returns a larger group of proteins from which a new set of conserved “motifs” are extracted and then used to interrogate a protein database until an even larger set of proteins are retuned and another profile is generated and the process repeated. By including related proteins into the query in each step this program allows the user to identify sequences that are more divergent.

iv) Nucleotide 6-frame translation-protein (blastx): Here the user provides a nucleotide sequence input which is converted into the six-frame conceptual translation products (i.e. both strands) against a protein sequence database.

v) Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): This program takes a DNA nucleotide sequence input and translates the input into all six-frame conceptual translation products which it compares against the six-frame translations of a nucleotide sequence database.

vi) Protein-nucleotide 6-frame translation (tblastn): This program uses a protein sequence input to compare against all six reading frames of a nucleotide sequence database.

Next, we describe commonly used programs for generating a Multiple Sequence Alignment (MSA) from a sequence dataset, and this is followed by a user guide to programs that determine the best-fit models of evolution for a sequence dataset. Phylogenetic reconstruction is a statistical problem, and because of this, phylogenetic methods need to incorporate a statistical framework. This statistical framework becomes an evolutionary model that incorporates sequence change within the dataset. This evolutionary model is comprised of a set of assumptions about the process of nucleotide or amino-acid substitutions, and the best-fit model for a particular dataset can be selected through statistical testing. The fit to the data of different models can be compared via likelihood ratio tests (LRTs) or information criteria to select the best-fit model within a set of possible ones. Two common information criteria are the Akaike information criterion (AIC)2 and the Bayesian information criterion (BIC)3. Once an optimal alignment is generated, there are many different methods to create a phylogeny from the aligned data. There are numerous methods of inferring evolutionary relationships; broadly, they can be divided into two categories: distance-based methods and sequence-based methods. Distance-based methods compute pairwise distances from sequences, and then use these distances to obtain the tree. Sequence-based methods use the sequence alignment directly, and usually search the tree space using an optimality criterion. We outline two sequence-based methods for reconstructing phylogenetic relationships: these are PhyML4 which implements the maximum likelihood framework, and MrBayes5 which uses Bayesian Markov Chain Monte Carlo inference. Likelihood and Bayesian methods provide a statistical framework for phylogenetic reconstruction. By providing user information on commonly used tree-building tools, we introduce the reader to the necessary data required to infer phylogenetic relationships.

Protokół

1. Basic Local Alignment Search Tool (BLAST): Online Interface

  1. Click on this link to visit the BLAST1 web server at the National Center for Biotechnology Information (NCBI). - http://blast.ncbi.nlm.nih.gov/Blast.cgi  (Figure 1).
  2. Input a FASTA formatted text sequence (see Figure 2 for example) into the query box.
  3. Click the appropriate BLAST program and relevant database or individual species of interest to use in the search and then click “BLAST”.
    Note: FASTA formatted sequence begins with a description line indicated by a “>” sign. The description must follow immediately after the “>” sign, the sequence (i.e. nucleotides or amino acids) follow the description on the next line. The output from the BLAST search is viewed as HTML, plain text, XML, or hit tables (Text or csv) with the default set to HTML (Figure 3).

2. Basic Local Alignment Search Tool (BLAST): Local Executables

  1. Download the latest BLAST command-line BLAST executables from this link:
    ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ -
  2. em>For PC users: double-click the latest blast win32.exe file and accept the license agreement and click install.
    Note: The default installation directory is C:\ncbi-blast-2.2.27+.
  3. Configure the PC environment variable as follows:
    1. Click the PC “start” button, and then right click “computer”,
    2. Click “Properties” and in the pop-up click on the “advanced” tab
    3. Click the “Environment Variables button” and in the new pop-up click the “new” button under the “User variables for user” section
    4. In the pop-up add the variable name “Path” and variable value “C:\ncbi-blast-2.2.27+\bin.
      Note: the bin directory contains the executable (i.e. blastp, etc.).
  4. em>For Mac users: Open the Terminal application (to do this just open “Finder” and search “Terminal” and this will display the “terminal” icon). Into the terminal window type:
    >ftp ftp.ncbi.nih.gov
    Note: can also type the URL used above in the example for PC
  5. To access the NCBI ftp site type “anonymous” for Name and Password, and then type:
    >cd blast/executables/LATEST
  6. List the executables by typing:
    >ls
  7. Get the latest version by typing the following (or whatever the latest version currently is):
    >get ncbi-blast-2.2.7-macosx.tar.gz
  8. Exit the NCBI ftp server site by typing “exit”.
  9. Decompress the downloaded files by typing:
    >tar -xzf ncbi-blast-2.2.7-macosx.tar.gz
  10. Add the location of the binaries for the blast executable to your path so that the shell can search through this directory when looking for commands by typing:
    >PATH=$PATH:new_folder_location
  11. Check if this added the location to your path by typing:
    >echo $PATH
  12. Download a preformatted BLAST databases (which are updated daily) by clicking here:
    ftp://ftp.ncbi.nlm.nih.gov/blast/db/
  13. Place the database into the “db” folder.
  14. em>On a PC: open a MS-DOS prompt (to do this click “start” and type “cmd” in the search bar) and change the directory to the ncbi-blast folder by typing:
    C:\Users>cd ..\ [moves up one folder]
    C:\>cd ncbi-blast-2.2.27+
    This will change the directory to:
    C:\ncbi-blast-2.2.27+>
  15. Create the database using the following “makedb” command:
    >makedb –in db/briggsae.fasta –dbtype prot –out db/briggsae
    Note: In the example below (Figure 4) the database is named “briggsae” and is comprised of one linkage group from the organism Caenorhabditis briggsae.
  16. Create a query protein sequence called “test” by inserting a FASTA formatted protein text sequence into the “db” folder.
  17. Interrogate the database via a blastp search by typing the following command:
    >blastp –query db/test.txt –db db/briggsae –out text.txt
  18. em>On a Mac: download a database for local Blast searches by accessing the NCBI ftp website as per the instructions above (step 2.4) and then type:
    >lcd ../databases/
  19. Download the genome or sequence of interest by typing:
    >get NC_[Accession #].fna
    Note: “.fna” refers to the FASTA formatted nucleotide sequence and “.faa” refers to the FASTA formatted amino acid sequences.
  20. Type “quit” to exit the ftp site.
  21. Make the database by typing:
    >makeblastdb -in db/mouse.faa -out mouse -dbtype prot
  22. Insert a FAST formatted query sequence into the “bin” folder and interrogate the database with the following command:
    > blastp -query “your query.fasta” -db ”your database” -out results.txt

3. Generating Multiple Sequence Alignments

  1. Click on these links to access commonly used Multiple Sequence Alignment (MSA) programs:
    ClustalW6 http://www.clustal.org/
    Kalign7 http://msa.sbc.su.se/cgi-bin/msa.cgi
    MAFFT8,9 http://mafft.cbrc.jp/alignment/software/
    MUSCLE10 http://www.drive5.com/muscle/
    T-Coffee11 http://www.tcoffee.org/Projects/tcoffee/
    PROBCONS12 http://toolkit.tuebingen.mpg.de/probcons
  2. Click on this link - http://tcoffee.crg.cat/apps/tcoffee/do:regular - and input FASTA formatted sequence data into the query box
    Note: A sample output from T-Coffee can be seen in Figure 5, similar residues are color coded.
  3. Download the Clustal MSA as a command line version (ClustalW) or a graphical version (ClustalX) by clicking this link: http://www.clustal.org/clustal2/ - then click on the appropriate executable (i.e. win, Linux, Mac OS X).
  4. Upload data as FASTA formatted sequence text and align (Figure 6).

4. Determining Best-fit Models of Evolution

  1. Click here to download the ProtTest13 program:
    http://darwin.uvigo.es/our-software/
  2. Once ProtTest is downloaded, double-click on the ProtTest.jar file
  3. Once ProtTest is launched, click on “select file” and load the sequence data (Figure 7).
  4. Then click “start” and the program will begin (Figure 8).
    Note: After completion of the run (Figure 8), the program will indicate the best model based on criteria e.g. “Best model according to AIC: WAG+I+G”

5. Inferring Sequence Based Phylogenies by Maximum Likelihood or Bayesian Inference

  1. Downloaded PhyML4 here:
    https://code.google.com/p/phyml/
  2. Launch the executable by double clicking the appropriate application (i.e. phyml Windows, phyml Linux, etc.) and the interface window will pop up (Figure 9).
  3. Load the input sequence as a PHYLIP formatted sequence by typing:
    >”file name”.phy
    Note: To convert between sequence formats, use the “Readseq” web program available at - http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi.
  4. Launch the program by typing “Y”.
  5. Download MrBayes5 here:
    http://mrbayes.sourceforge.net/download.php
  6. To start the program click on the executable file and read NEXUS formatted sequence data into the program by typing:
    >execute “file name”.nex
  7. Set the evolutionary model.
  8. Select the number of generations to run by typing:
    >mcmcp ngen = 1000000 [this sets the number of generations to 1000000]
    >sump burnin =10000 [this sets the burnin to 10000]
  9. Save the branch lengths in the results file by typing:
    >mcmcp savebrlens = yes 
  10. Run the analysis by typing:
    >mcmc
  11. Summarize the trees using the “sumt” command.

6. Visualizing Phylogenies

  1. View a list of tree viewer programs here:
    http://www.treedyn.org/overview/editors.html
  2. Download the TreeView14 program here:
    http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

Wyniki

Finding similarities to a query allows researchers to ascribe a potential identity to new sequences and also infer relationships between sequences. The file input type for BLAST1 is FASTA formatted text sequence or GenBank accession number. FASTA formatted sequence begins with a description line indicated by a “>” sign (Figure 2). The description must follow immediately after the “>” sign, the sequence (i.e. nucleotides or amino acids) follow the descri...

Dyskusje

Our hope for this article is that it will serve as a starting point to guide researchers or students that are new to phylogenetics. Genome sequencing projects have become less expensive over the last few years and as a consequence the user demand for this technology is increasing, and now the production of large sequence datasets is commonplace in small labs. These datasets often provide researchers with sets of genes that require a phylogenetic framework to begin to understand their function. Furthermore, because phylog...

Ujawnienia

We have nothing to disclose. 

Podziękowania

We thank members of the O’Halloran lab for comments on the manuscript. We thank The George Washington University Department of Biological Sciences and Columbian College of Arts and Sciences for Funding to D. O’Halloran.

Materiały

NameCompanyCatalog NumberComments
BLAST webpage http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST executables ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Preformatted BLAST databasesftp://ftp.ncbi.nlm.nih.gov/blast/db/
Clustalhttp://www.clustal.org/
Kalignhttp://msa.sbc.su.se/cgi-bin/msa.cgi
MAFFThttp://mafft.cbrc.jp/alignment/software/
MUSCLEhttp://www.drive5.com/muscle/
T-Coffeehttp://www.tcoffee.org/Projects/tcoffee/
PROBCONShttp://toolkit.tuebingen.mpg.de/probcons 
Se-Al http://tree.bio.ed.ac.uk/software/seal/
BSEdit http://www.bsedit.org/
JalViewhttp://www.jalview.org/
SeaViewhttp://pbil.univ-lyon1.fr/software/seaview.html
ProtTest https://code.google.com/p/prottest3/
Java Runtime http://www.java.com/en/download/chrome.jsp
Readseqhttp://iubio.bio.indiana.edu/cgi-bin/readseq.cgi
jModelTesthttps://code.google.com/p/jmodeltest2/
PhyMLhttps://code.google.com/p/phyml/
MrBayeshttp://mrbayes.sourceforge.net/download.php
TreeViewhttp://taxonomy.zoology.gla.ac.uk/rod/treeview.html
TreeDynhttp://www.treedyn.org/

Odniesienia

  1. Altschul, S. F., Carroll, R. J., Lipman, D. J. Weights for data related by a tree. J. Mol. Biol. 207 (4), 647-653 (1989).
  2. Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19 (6), 706-723 (1974).
  3. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6 (2), 461-464 (1978).
  4. Guindon, S., Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52 (5), 696-704 (2003).
  5. Huelsenbeck, J. P., Ronquist, F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 17 (8), 754-755 (2001).
  6. Thompson, J. D., Higgins, D. G., Gibson, T. J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 (22), 4673-4680 (1994).
  7. Lassmann, T., Sonnhammer, E. L. Kalign--an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics. 6, 298 (2005).
  8. Katoh, K., Kuma, K., Toh, H., Miyata, T. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33 (2), 511-518 (2005).
  9. Katoh, K., Misawa, K., Kuma, K., Miyata, T. MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res. 30 (14), 3059-3066 (2002).
  10. Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 (5), 1792-1797 (2004).
  11. Notredame, C., Higgins, D. G., Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 (1), 205-217 (2000).
  12. Do, C. B., Mahabhashyam, M. S., Brudno, M., Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15 (2), 330-340 (2005).
  13. Darriba, D., Taboada, G. L., Doallo, R., Posada, D. ProtTest 3: Fast selection of best-fit models of protein evolution. Bioinformatics. 27 (8), 1164-1165 (2011).
  14. Page, R. D. TreeView: An application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12 (4), 357-358 (1996).
  15. Darriba, D., Taboada, G. L., Doallo, R., Posada, D. jModelTest 2: More models, new heuristics and parallel computing. Nat. Methods. 9 (8), 772 (2012).
  16. Chevenet, F., Brun, C., Banuls, A. L., Jacq, B., Christen, R. TreeDyn: Towards dynamic graphics and annotations for analyses of trees. BMC Bioinformatics. 7, 439 (2006).

Przedruki i uprawnienia

Zapytaj o uprawnienia na użycie tekstu lub obrazów z tego artykułu JoVE

Zapytaj o uprawnienia

Przeglądaj więcej artyków

PhylogeneticsPhylogenetic AnalysisSequence AlignmentEvolutionary ModelsMaximum LikelihoodBayesian InferencePhylogenetic Tree VisualizationBioinformatics ToolsPractical GuideNonexpert Researchers

This article has been published

Video Coming Soon

JoVE Logo

Prywatność

Warunki Korzystania

Zasady

Badania

Edukacja

O JoVE

Copyright © 2025 MyJoVE Corporation. Wszelkie prawa zastrzeżone