Using Phylogenetic Analysis to Investigate Eukaryotic Gene Origin

Dechun Zhang; Xianzhao Kan; Sarah Elizabeth Huss; Lan Jiang; Li-Qing Chen; Yibing Hu

doi:10.3791/56684

A subscription to JoVE is required to view this content. Sign in or start your free trial.

Summary

A method of constructing a phylogenetic tree based on sequence homology of SWEETs from eukaryotes and SemiSWEETs from prokaryotes is described. Phylogenetic analysis is a useful tool for explaining the evolutionary relatedness between homologous proteins or genes from different organism groups.

Abstract

Phylogenetic analysis uses nucleotide or amino acid sequences or other parameters, such as domain sequences and three-dimensional structure, to construct a tree to show the evolutionary relationship among different taxa (classification units) at the molecular level. Phylogenetic analysis can also be used to investigate domain relationships within an individual taxon, particularly for organisms that have undergone substantial change in morphology and physiology, but for which researchers lack fossil evidence due to the organisms' long evolutionary history or scarcity of fossilization.

In this text, a detailed protocol is described for using the phylogenetic method, including amino acid sequence alignment using Clustal Omega, and subsequent phylogenetic tree construction using both Maximum Likelihood (ML) of Molecular Evolutionary Genetics Analysis (MEGA) and Bayesian Inference via MrBayes. To investigate the origin of eukaryotic Sugars Will Eventually be Exported Transporters (SWEET) genes, 228 SWEETs including 35 SWEET proteins from unicellular eukaryotes and 57 SemiSWEET proteins from prokaryotes were analyzed. Interestingly, SemiSWEETs were found in prokaryotes, but SWEETs were found in eukaryotes. Two phylogenetic trees constructed using theoretically distinct methods have consistently suggested that the first eukaryotic SWEET gene might stem from the fusion of a bacterial SemiSWEET gene and an archaeal SemiSWEET gene. It is worth noting that one should be cautious to draw a conclusion based only on phylogenetic analysis, although it is useful to explain the underlying relationship between different taxa, which is difficult or even impossible to discern through experimental means.

Introduction

DNA or RNA sequences carry genetic information for underlying phenotypes that can be analyzed through physiological and biochemical methods or observed through morphological and fossil evidence. In a sense, genetic information is more reliable than evaluating external phenotypes because the former is the basis for the latter. In evolutionary study, fossil evidence is very direct and convincing. However, many organisms, such as microorganisms, have little chance to form a fossil during long geologic ages. Therefore, molecular information such as nucleotide sequences and amino acid sequences from related extant organisms are of value for exploring evolutionary relationships¹. In the present study, a simple introduction of basic phylogenetic knowledge and an easy-to-learn protocol was provided for newcomers who need to construct a phylogenetic tree on their own.

Both DNA (nucleotide) and protein (amino acid) sequences can be used to infer phylogenetic relationships between homologous genes, organelles, or even organisms². DNA sequences are more likely to be affected by changes during evolution. In contrast, amino acid sequences are much more stable given that synonymous mutations in nucleotide sequences do not cause mutations in amino acid sequences. As a result, DNA sequences are useful for comparison of homologous genes from closely related organisms, whereas amino acid sequences are appropriate for homologous genes from distantly related organisms³.

A phylogenetic analysis begins with the alignment of amino acid or nucleotide sequences⁴ retrieved from an annotated genome sequencing database⁵ listed in FASTA format, i.e., putative or expressed protein sequences, RNA sequences, or DNA sequences. It is worth noting that it is critical to collect high-quality sequences for the analysis, and only homologous sequences can be used to analyze phylogenetic relationships. Many different platforms such as Clustal W, Clustal X, Muscle, T-coffee, MAFFT, can be used for sequence alignment. The most widely used is Clustal Omega⁶^,⁷ (http://www.ebi.ac.uk/Tools/msa/clustalo/), which can be used online or can be downloaded free of charge. The alignment tool has many parameters that the user can adjust before starting the alignment, but the default parameters work well in most cases. After the process is complete, the aligned sequences should be saved in the correct format for the next step. They should then be edited or trimmed using an editing software, such as BioEdit, because phylogenetic tree construction by MEGA requires the sequences to be of equal length (including both amino acid abbreviations and hyphens. In the aligned sequence, any position without an amino acid or nucleotide is represented by a hyphen "-"). Generally, all of the protruding amino acids or nucleotides at either end of the alignment should be removed. In addition, columns containing poorly aligned sequences in the alignment can be deleted because they convey little valuable information, and can sometimes give confusing or false information³. The columns containing one or more hyphens can be deleted at this time or in the later tree construction stage. Alternatively, they can be used for phylogenetic computation. When the sequence alignment and trimming is finished, the aligned sequences should be saved in FASTA format, or the desired format, for later use.

Many software platforms provide tree construction functions using different methods or algorithms. In general, the methods can be classified as either distance matrix methods or discrete data methods. Distance matrix methods are simple and fast to compute, while discrete data methods are complicated and time-consuming. For very closely related taxa with a high degree of sharing of amino acid or nucleotide sequence identity, a distance matrix method (Neighbor Joining: NJ; Unweighted Pair Group Method with Arithmetic mean: UPGMA) is appropriate; for distantly related taxa, a discrete data method (Maximum Likelihood: ML; Maximum Parsimony: MP; Bayesian Inference) is optimal³^,⁸. In this study, the ML methods in MEGA (6.0.6) and Bayesian Inference (MrBayes 3.2) were applied to construct phylogenetic trees⁹. Ideally, when the proper model and parameters are used, the results derived from different methods may be consistent, and they are thus more reliable and convincing.

For a ML phylogenetic tree constructed using MEGA¹⁰, the aligned sequence file in FASTA format must be uploaded into the program. The first step then is to select the optimal substitution model for the uploaded data. All available substitution models are compared based on the uploaded sequences, and their final scores will be shown in a results table. Select the model with the smallest Bayesian Information Criterion (BIC) score (listed first in the table), set ML parameters according to the recommended model, and start the computation. The computation time varies from several minutes to several days, depending on the complexity of the loaded data (length of the sequences and number of taxa) and the performance of the computer on which the programs are run. When the computation is finished, a phylogenetic tree will be shown in a new window. Save the file as "FileName.mat". After setting parameters to specify the appearance of the tree, save once more. Using this method, MEGA can generate publication grade phylogenetic tree figures.

For tree construction with MrBayes¹¹, the first step is to transform the aligned sequence, which is usually listed in FASTA format, into nexus format (.nex as the file type). Transforming FASTA files into nexus format can be processed in MEGA. Next, the aligned sequence in nexus format can be uploaded into MrBayes. When the file is successfully uploaded, specify detailed parameters for the tree computation. These parameters include details such as amino acid substitution model, variation rates, chain number for Markov chain Monte Carlo (MCMC) coupling, ngen number, average standard deviation of split frequencies, and so on. After these parameters have been specified, start the computation. In the end, two tree figures in ASC II code, one showing clade credibility and the other showing branch lengths, will be displayed on the screen.

The tree result will be saved automatically as "FileName.nex.con". This tree file can be opened and edited by FigTree, and the figure displayed in FigTree can be modified further to make it more suitable for publication.

In this study, 228 SWEET proteins, including 35 SWEETs from unicellular eukaryotes and 57 SemiSWEETs from prokaryotes, were analyzed as an example. Both the SWEETs and SemiSWEETs were characterized as glucose, fructose, or sucrose transporters across membranes¹²^,¹³. Phylogenetic analysis suggests that the two MtN3/saliva domains containing SWEETs might be derived from an evolutionary fusion of a bacterial SemiSWEET and of an archaeon¹⁴.

Protocol

1. Sequence Alignment

Collect amino acid sequences of eukaryotic SWEET and prokaryotic SemiSWEET in separate documents and list them in FASTA format. Download sequences from the National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ) databases by similarity search with the Basic Local Alignment Search Tool (BLAST) tool.
1. In the example files, collect 228 putative SWEET protein sequences possessing two MtN3/saliva domains (7 transmembrane helices) of eukaryotes and 57 SemiSWEET protein sequences possessing a single MtN3/saliva domain (3 transmembrane helices) of prokaryotes¹³.
2. To simplify the process, select 35 candidate SWEET proteins from unicellular eukaryotic organisms among the 228 putative SWEETs for phylogenetic tree construction. These sequences are attached so that the reader may practice on a real data set.
Align the 35 SWEET sequences by inputting them into Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/).
1. Copy and paste the protein sequences in FASTA format into the input box or upload a sequence file in FASTA format. Specify that they are amino acid sequence by clicking the icon under the pull-down menu in the 'STEP 1' section.
2. Specify output format and other parameters in the 'STEP 2' section, if necessary. For this study, set output format as "clustal w/o number", and leave the other parameters on default settings. In most cases, the default parameters work well without any specification.
Submit and run the alignment in the 'STEP 3' section. It may take anywhere from several seconds to minutes until the alignment is finished. In the "Result Summary" panel, right-click the link under the "Alignment in CLUSTAL format" and save the aligned sequences as "35.clustal" (Figure 1).
Open the alignment result file in BioEdit.
1. On the main panel of BioEdit, click "Sequence" and select "Edit Mood"in the first pull-down menu, then click "Edit Residues" in the sub-menu (Figure 2).
2. Select the protruding sequences on the left side of the alignment with the cursor (the selected sequence will be shown in black) and click the "Delete" icon under the "Edit" menu to remove the selected sequences (Figure 3).
3. Select and delete the protruding sequences on the right side of the first MtN3/saliva domain, and save the trimmed first MtN3/saliva domain sequences as 35-I.fas (Figure 4). Likewise, delete the left and right side protruding sequences of the second MtN3/saliva domain and save it as 35-II.fas. The first and the second MtN3/saliva domain sequences can be predicted with RHYTHM (http://proteinformatics.charite.de/rhythm/inndex.php?site=helix) or TMHMM (http://www.cbs.dtu.dk/services/TMHMM/) in advance.
Open the file 35-I.fas with MEGA, and click "align" when prompted. Under the "Edit" menu, click "Select All", then click "Select Sequence(s)"; the names and sequences of the taxa will be selected in black (Figure 5).
1. Choose "Copy" from the "Edit" menu to copy the sequences onto the clipboard, and then paste the copied sequences into a doc file.
2. In the doc file, replace all "#" with ">", and then delete any unrelated characters to convert them to FASTA format. Add "-I" at the end of each taxon name to mark them as the first MtN3/saliva domain sequences. Process the second MtN3/saliva domain sequence following the same method and add "-II" after each taxon name.
Combine the first and second MtN3/saliva domain sequences in FASTA format in a doc file.
1. Load the combined sequences into Clustal Omega again and align the sequences as described above. Save the result as "35 realigned.clustal".
2. Open the "35 realigned.clustal" file in BioEdit, delete the uneven (protruding) amino acid residues at either end of the aligned sequences, and then save the sequences as "35 realigned.fas". Click "Yes" when warned that some non-standard characters cannot be saved.

2. Computation of the Phylogenetic Tree

Open "35 realigned.fas" in MEGA.
1. Click the "Data" menu and choose "Export Alignment", and save the alignment in PAUP format (nexus) as "35.nex" for later use in MrBayes (Figure 6).
2. Meanwhile, click the "Models" icon on the main panel of MEGA, choose "Find Best DNA/Protein Models (ML)", and click "OK" on the pop-up window. Click "Compute" to begin the model searching process (Figure 7). A new progress panel will open; this process lasts several minutes to several days, depending on the complexity of the loaded sequences and the computer's performance.
  NOTE: A table showing the results will open after the model searching process is finished ( Figure 8). The smallest BIC score will be listed first, followed by a series of different models with gradually increasing BIC scores. The first model "LG+G+F" with the smallest BIC score is the recommended model for ML tree based on the "35 realigned.fas" file.
Click the "Phylogeny" icon on the main panel of MEGA, click "Construct/Test the Maximum Likelihood Tree", and then click "Yes" on the pop-up panel. A new window will open showing different parameters that need to be specified (Figure 9).
1. First, set the bootstrap value in the test of the phylogeny box; 500 or 1,000 is adequate in most cases. Under the substitution model, choose "amino acid" as the substitution type. The purpose of choosing a substitution model is to estimate the true difference between sequences based on their present states³.
2. Select "LG with Freqs. (+F) model" (LG+F) in the model/method box. In the rates and pattern box, select "Gamma Distributed" (G) to describe rate variations across sites, i.e., giving more weight to changes at slowly evolving sites³. In the data subset box, select "Complete deletion" to remove all of the columns containing hyphens.
3. Keep all other parameters in their default states (Figure 9). After specification of these parameters, click the "Compute" icon to start the calculation.

3. Presentation of the Phylogenetic Tree

NOTE: A phylogenetic ML tree will be presented when the computation using MEGA is finished (Figure 10).

Under the pull-down menu of the "File" icon on the tree panel, choose "Save Current Session" to save the result (.mas is the default file type). In the present study, the result was saved as "35.mas". On the tree panel, many parameters including length of clade, tree style, tree topology, font of the taxon name, size, and color, are displayed and can be set to different options.
Save the final tree file by clicking the image icon, and save the figure in different formats or copy the image as the source for photo-editing.

4. Analysis of the Relationship of SWEETs and SemiSWEETs Using Sequence Alignment

NOTE: This step may not be needed in ordinary sequence analysis.

Align the 228 eukaryotic SWEETs and 57 prokaryotic SemiSWEETs in Clustal Omega as described above. The alignment results can be shown in Jalview, which is integrated in Clustal Omega, and copied to save in a photo editor (Figure 11).
NOTE: In the example alignment, some SemiSWEETs from α-Proteobacteria are aligned with the first MtN3/saliva domain of the SWEET sequences, whereas SemiSWEETs from Methanobacteria (archaea) are aligned with the second MtN3/saliva domain of the SWEET sequences.

5. Phylogenetic Tree Construction with MrBayes

For Bayesian Inferences with MrBayes, open the MrBayes executable file and a DOS interface will come up in a new window. The first step is to read the nexus data ﬁle. Input "execute 35.nex" after the prompt (remember to save the 35.nex file in the same directory of the MrBayes executable file, or point out the pathway of the file before uploading it). A "successful read matrix" message will be shown following the last of the listed taxa (Figure 12). The 35.nex file has already been prepared and saved in MEGA (see 2.1 above).
Set the evolutionary model.
1. After the prompt, type "prset aamodelpr = fixed(lg); lset rates = g". The "lg" and "g" correspond to the "LG" and "G" model which is set in MEGA. After successfully setting the model, type "mcmc nchains = 4 ngen = 5,000,000" after the prompt. Use of the "nchains=4" entry signifies a total number of one cold chain and three hot chains for Metropolis coupling. "ngen = 5,000,000" means to run 5,000,000 generations of Metropolis coupling for convergence of the cold and hot chains. In this study, average standard deviation of split frequencies below 0.01 was regarded as convergence of the hot and cold chains.
2. Note that the ngen number cannot be predicted accurately at the beginning of the process, and usually needs to be adjusted based on the change in the average standard deviation of split frequencies. In addition, the ngen number for convergence may be different each time when running the program based on the same data.
Run the analysis: This step lasts from several minutes to several days, depending on the complexity of input data and the performance of the computer. After completing the preset computation, a prompt will ask "Continue with analysis (yes/no)?" If "no" is typed after the prompt, the computing will stop (Figure 13), otherwise it will continue to compute after the number of further generations is input. When the computation is finished (with an average standard deviation of split frequencies <0.01 or 0.05), stop the computation by typing "no" after the inquiry prompt.
NOTE: 0.01 is a strict criterion, 0.05 is moderate and usually adequate.
Summarize the samples: Type "sump" after the prompt to summarize samples of model parameters (Figure 14). Then type "sumt relburnin=yes burninfrac=0.25" after the prompt to summarize tree samples. Detailed information about phylogenetic tree construction will be displayed as in Figure 15, followed by two tree figures that will appear in ASC II code on the screen, one showing clade credibility and the other showing branch lengths. At the same time, a tree file with the name of "35.nex.con" will be saved automatically.
For a better presentation of the phylogenetic tree, open the "35.nex.con" tree file with the FigTree tool (http://tree.bio.ed.ac.uk/software/figtree/), select a style or size to display the result (Figure 16), or even edit it in a photo editor to make it more reader-friendly.

Results

Phylogenetic trees show that all of the first MtN3/saliva domains of the 35 SWEET sequences clustered as one clade and the second MtN3/saliva domains of the SWEET sequences clustered as another clade. In addition, alignment results of the SWEETs and SemiSWEETs show that some SemiSWEETs from α-Proteobacteria aligned with the first MtN3/saliva domain of the SWEET sequences, whereas SemiSWEETs from Methanobacteria (archaea) aligned with the second MtN3/saliva domain of the SWEET sequenc...

Discussion

It is becoming increasingly popular in biological research to make a phylogenetic tree based on nucleotide or amino acid sequences⁸. Generally, there are three critical stages of the practice including sequence alignment, evaluation of the aligned sequences with the proper method or algorithm, and visualization of the computational result as a phylogenetic tree. In the presented study, three rounds of sequence alignment were conducted: first, the SWEET protein sequences, including the first a...

Disclosures

The authors have nothing to disclose.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (31371596), the Bio-technology Research Center, China Three Gorges University (2016KBC04), and the Natural Science Foundation of Jiangsu Province, China (BK20151424).

Materials

Name	Company	Catalog Number	Comments
Adobe Illustration			a graphical tool developed by Adobe Systems Software Ireland Ltd. Copyright © 2017
BioEdit			a biological sequence alignment editor written for Windows 95/98/NT/2000/XP/7. Copyright © Tom Hall
Clustal Omega			a package for making multiple sequence alignments of amino acid or nucleotide sequences. http://www.clustal.org/
CorelDRAW			a graphic design software. Copyright © 2017 Corel Corporation
FigTree			a graphical viewer of phylogenetic trees designed by the University of Edinburgh
MEGA			MolecularEvolutionary Genetics Analysis version6.0 http://www.megasoftware.net/home
MrBayes			an Bayesian phylogenetic inference tool
NVIDIA			a company designs graphics processing units (GPUs) for the gaming and professional markets. Corporation Copyright © 2017
PAUP			Phylogenetic Analysis Using Parsimony. David Swofford's program implements the maximum likelihood method under a number of nucleotide models.
Photoshop			a raster graphics editor developed and published by Adobe Systems Software Ireland Ltd. Copyright © 2017
RHYTHM			a knowledge based prediction of hekix contacts. Charité Berlin – Protein Formatics Group - Copyright 2007-2009
TMHMM			a tool for prediction of transmembrane helices in proteins. http://www.cbs.dtu.dk/services/TMHMM/
Compter			4 GB memory, Core 2 or above CPU. Windows 7, Windows 10

References

Nei, M., Kumar, S. . Molecular Evolution and Phylogenetics. , (2000).
Foth, B. J. Phylogenetic analysis to uncover organellar origins of nuclear-encoded genes. Methods Mol Biol. 390, 467-488 (2007).
Baldauf, S. L. Phylogeny for the faint of heart: a tutorial. Trends Genet. 19, 345-351 (2003).
Feng, D. F., Doolittle, R. F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 25, 351-360 (1987).
Persson, B. Bioinformatics in protein analysis. EXS. 88, 215-231 (2000).
Sievers, F., et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 7, 539 (2011).
Sievers, F., Higgins, D. G. Clustal omega. Curr Protoc Bioinformatics. 48, 1-16 (2014).
Yang, Z., Rannala, B. Molecular phylogenetics: principles and practice. Nat Rev Genet. 13, 303-314 (2012).
Hall, B. G. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol. 22, 792-802 (2005).
Tamura, K., Stecher, G., Peterson, D., Filipski, A., Kumar, S. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol. 30, 2725-2729 (2013).
Ronquist, F., et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 61, 539-542 (2012).
Chen, L. Q., et al. Sugar transporters for intercellular exchange and nutrition of pathogens. Nature. 468, 527-532 (2010).
Xuan, Y., et al. Functional role of oligomerization for bacterial and plant SWEET sugar transporter family. Proc Natl Acad Sci USA. 110, 3685-3694 (2013).
Hu, Y., et al. Phylogenetic evidence for a fusion of archaeal and bacterial SemiSWEETs to form eukaryotic SWEETs and identification of SWEET hexose transporters in the amphibian chytrid pathogen Batrachochytrium dendrobatidis. FASEB J. 30, 3644-3654 (2016).
Holder, M. T., Zwickl, D. J., Dessimoz, C. Evaluating the robustness of phylogenetic methods to among-site variability in substitution processes. Philos Trans R Soc Lond B Biol Sci. 363, 4013-4021 (2008).
Alfaro, M. E., Holder, M. T. The Posterior and the Prior in Bayesian Phylogenetics. Annu Rev Ecol Evol Syst. 37, 19-42 (2006).
Suchard, M., Rambaut, A. Many-core algorithms for statistical phylogenetics. Bioinformatics. 25, 1370-1376 (2009).
Zierke, S., Bakos, J. FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods. BMC Bioinformatics. 11, 184 (2010).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

Phylogenetic Analysis Eukaryotic Gene Origin Sequence Alignment Clustal Omega SWEET Genes BioEdit MEGA MtN3 Saliva Domain Protein Sequences FASTA Format Sequence Trimming

This article has been published

Video Coming Soon

Keep me updated: