Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar;27(3):552-69.
doi: 10.1093/molbev/msp250. Epub 2009 Oct 15.

Fast and consistent estimation of species trees using supermatrix rooted triples

Affiliations

Fast and consistent estimation of species trees using supermatrix rooted triples

Michael DeGiorgio et al. Mol Biol Evol. 2010 Mar.

Abstract

Concatenated sequence alignments are often used to infer species-level relationships. Previous studies have shown that analysis of concatenated data using maximum likelihood (ML) can produce misleading results when loci have differing gene tree topologies due to incomplete lineage sorting. Here, we develop a polynomial time method that utilizes the modified mincut supertree algorithm to construct an estimated species tree from inferred rooted triples of concatenated alignments. We term this method SuperMatrix Rooted Triple (SMRT) and use the notation SMRT-ML when rooted triples are inferred by ML. We use simulations to investigate the performance of SMRT-ML under Jukes-Cantor and general time-reversible substitution models for four- and five-taxon species trees and also apply the method to an empirical data set of yeast genes. We find that SMRT-ML converges to the correct species tree in many cases in which ML on the full concatenated data set fails to do so. SMRT-ML can be conservative in that its output tree is often partially unresolved for problematic clades. We show analytically that when the species tree is clocklike and mutations occur under the Cavender-Farris-Neyman substitution model, as the number of genes increases, SMRT-ML is increasingly likely to infer the correct species tree even when the most likely gene tree does not match the species tree. SMRT-ML is therefore a computationally efficient and statistically consistent estimator of the species tree when gene trees are distributed according to the multispecies coalescent model.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1.
FIG. 1.
Four- and five-taxon clocklike species tree topologies. (A, B) Four-taxon species tree topologies with branch lengths x, y, and z. (CE) Five-taxon species tree topologies with branch lengths w, x, y, and z. Branch lengths are in coalescent time units t/(2Ne), where t is the time in generations and Ne is the effective population size. For all simulations, we let z = 1.
F<sc>IG</sc>. 2.
FIG. 2.
Schematic of our simulation procedure. First, an n-taxon species tree is chosen with branch lengths, which is fed through COAL (Degnan and Salter 2005) to produce a set of n-taxon gene trees simulated under this species tree. Seq-Gen (Rambaut and Grassly 1997) is then used to create alignments of n species based on the gene trees, which are linked to create a single concatenated alignment. The concatenated alignment is analyzed under maximum likelihood (SM-ML) with PAUP* (Swofford 2003) to infer a species tree. The concatenated alignment is also broken into formula image all alignments of three species, which are then fed through PAUP* to infer a total of formula image rooted triples. These rooted triples are used as input to supertree (Page 2002) to infer a species tree (SMRT-ML). The dashed gray box represents the part of the procedure that is SMRT-ML.
F<sc>IG</sc>. 3.
FIG. 3.
Results of simulations for the four-taxon tree (((AB)C)D) (fig. 1A) generated under a JC model with θ = 0.01 and a molecular clock and analyzed under ML assuming a molecular clock and a JC model. (AH) SM-ML (resimulated from Kubatko and Degnan 2007); (IP) SMRT-ML. Data for each combination of branch lengths and number of loci were generated from 300 independent simulations.
F<sc>IG</sc>. 4.
FIG. 4.
Results of simulations for the four-taxon tree ((AB)(CD)) (fig. 1B) generated under a JC model with θ = 0.01 and a molecular clock and analyzed under ML assuming a molecular clock and a JC model. (AH) SM-ML; (IP) SMRT-ML. Data for each combination of branch lengths and number of loci were generated from 300 independent simulations.
F<sc>IG</sc>. 5.
FIG. 5.
Results of simulations for the five-taxon tree ((((AB)C)D)E) (fig. 1C) generated under a JC model with θ = 0.01 and a molecular clock and analyzed under ML assuming a molecular clock and a JC model. (AE) SM-ML; (EH) SMRT-ML. Data for each combination of branch lengths and number of loci were generated from 300 independent simulations.
F<sc>IG</sc>. 6.
FIG. 6.
Results of simulations for the five-taxon tree (((AB)C)(DE)) (fig. 1D) generated under a JC model with θ = 0.01 and a molecular clock and analyzed under ML assuming a molecular clock and a JC model. (AE) SM-ML; (EH) SMRT-ML. Data for each combination of branch lengths and number of loci were generated from 300 independent simulations.
F<sc>IG</sc>. 7.
FIG. 7.
Results of simulations for the five-taxon tree (((AB)(CD))E) (fig. 1E) generated under a JC model with θ = 0.01 and a molecular clock and analyzed under ML assuming a molecular clock and a JC model. (AE) SM-ML; (EH) SMRT-ML. Data for each combination of branch lengths and number of loci were generated from 300 independent simulations.
F<sc>IG</sc>. 8.
FIG. 8.
Results of simulations for the four-taxon tree (((AB)C)D) (fig. 1A) generated under a JC model with θ = 0.01 and a violation of the molecular clock and analyzed under ML assuming a molecular clock and a JC model. (AH) SM-ML; (IP) SMRT-ML. Data for each combination of branch lengths and number of loci were generated from 300 independent simulations.
F<sc>IG</sc>. 9.
FIG. 9.
A three-taxon gene tree within a model species tree with notation used in the paper. In all cases, the species tree has the topology ((AB)C). Dots represent coalescent events. (A) and (B) depict the same gene tree topology with different coalescent histories. The gene tree in (C) has the ((AC)B) topology; the gene tree in (D) has the ((BC)A) topology.

Similar articles

Cited by

References

    1. Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981;10:405–421.
    1. Ané C, Larget B, Baum DA, Smith SD, Rokas A. Bayesian estimation of concordance factors. Mol Biol Evol. 2007;24:412–426. - PubMed
    1. Baum BR. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon. 1992;41:3–10.
    1. Bininda-Emonds ORP. The evolution of supertrees. Trends Ecol Evol. 2004;19:315–322. - PubMed
    1. Bryant D. A classification of consensus methods for phylogenies. In: Janowitz MF, Lapointe F-J, McMorris FR, Mirkin B, Roberts FS, editors. Bioconsensus. Vol. 61. Providence (RI): DIMACS, AMS; 2003. pp. 163–183.

Publication types