Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb;27(2):249-65.
doi: 10.1093/molbev/msp228. Epub 2009 Sep 25.

Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories

Affiliations

Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories

A P Jason de Koning et al. Mol Biol Evol. 2010 Feb.

Abstract

Likelihood-based approaches can reconstruct evolutionary processes in greater detail and with better precision from larger data sets. The extremely large comparative genomic data sets that are now being generated thus create new opportunities for understanding molecular evolution, but analysis of such large quantities of data poses escalating computational challenges. Recently developed Markov chain Monte Carlo methods that augment substitution histories are a promising approach to alleviate these computational costs. We analyzed the computational costs of several such approaches, considering how they scale with model and data set complexity. This provided a theoretical framework to understand the most important computational bottlenecks, leading us to combine novel variations of our conditional pathway integration approach with recent advances made by others. The resulting technique ("partial sampling" of substitution histories) is considerably faster than all other approaches we considered. It is accurate, simple to implement, and scales exceptionally well with dimensions of model complexity and data set size. In particular, the time complexity of sampling unobserved substitution histories using the new method is much faster than previously existing methods, and model parameter and branch length updates are independent of data set size. We compared the performance of methods on a 224-taxon set of mammalian cytochrome-b sequences. For a simple nucleotide substitution model, partial sampling was at least 10 times faster than the PhyloBayes program, which samples substitutions in continuous time, and about 100 times faster than when using fully integrated substitution histories. Under a general reversible model of amino acid substitution, the partial sampling method was 1,600 times faster than when using fully integrated substitution histories, confirming significantly improved scaling with model state-space complexity. Partial sampling of substitutions thus dramatically improves the utility of likelihood approaches for analyzing complex evolutionary processes on large data sets.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1
FIG. 1
Strategies for sampling substitution histories. (a) Substitutions are fully sampled in continuous time using a method such as Nielsen (2002) or Rodrigue et al.(2008). (b) Substitutions sampled in discrete time over short-branch segments of equal length (Hwang and Green 2004). (c) Substitutions partially sampled in continuous time to within short time intervals of different lengths using B1 CP intregration (B1 × M). (d) Substitutions partially sampled in continuous time as ancestral and transient states at fixed points that bisect long branches; the timing of substitutions between nodes is analytically integrated using uniformized B1 CP integration (B1u). For both (b,c), branch lengths are rounded to the nearest small segment size.
F<sc>IG</sc>. 2
FIG. 2
Validity and accuracy of methods assuming one substitution per branch segment. (a) The probability of more than one substitution per site with increasing branch segment size, calculated from the fully time-integrated Markov process. The mean ± standard deviation over 50 random GTR rate matrices is displayed for each segment size. (b) Approximate asymptotic expected error in rate parameters. Using the optimal uniformization constant, B1u is slightly more accurate than the standard B1 method (Krishnan et al. 2004), although both derivations give a similar result. Both B1 and B1u are much more accurate than HG, although they incur little additional computational expense.
F<sc>IG</sc>. 3
FIG. 3
Time to analyze 224 mammalian cytochrome-b DNA sequences. (a) Time to complete 100,000 generations of MCMC. GTR rate parameters were updated 99.8% of the time, and ancestral/transient states the remaining 0.2%. Speeds varied across two orders of magnitude, despite the simplicity of the model. (b) Time to finish 100 complete ancestral/transient state updates. (c) Time to complete 10,000 rate matrix updates. Methods marked with a dagger (†) are introduced in this manuscript.
F<sc>IG</sc>. 4
FIG. 4
Posterior parameter distributions inferred using fully integrated probability calculations and B1u. MCMC analyses were run for 100,000 generations, with the first 5,000 excluded as burn-in. (a) Fully integrated probability calculations based on the 224 taxon data set of cytochrome-b sequences; (b) Partially integrated calculations using B1u and the 224 taxon data set; (c) Fully integrated probability calculations on the pruned 10 taxon data set; (d) Partially integrated calculations using B1u and the pruned 10 taxon data set.
F<sc>IG</sc>. 5
FIG. 5
Predicted speed increase of B1u compared with full integration of ancestral states (left) and sampling of substitutions in continuous time (right). The effect of increasing model complexity is shown for increasing numbers of branches. Predictions are based on the approximate computational time complexity (table 1), calculated assuming 99.8% rate parameter updates requiring a likelihood function evaluation, and 0.2% missing data updates. Calculations were made assuming 1,000 sites, and the method of (Rodrigue et al. 2008) for sampling substitutions in continuous time with a maximum number of substitutions per branch set to 2. The number of effective branches was set to 1.1 times b, which was the observed value from the 224 taxon cytochrome-b data set using a maximum segment size of 0.08. (a) Predicted speed increase for reversible models. (b) Predicted speed increase for nonreversible models.
F<sc>IG</sc>. 6
FIG. 6
Comparison of ML support for 10 different tree topologies using partial sampling (B1u) and full integration of substitution histories (Full). (a) Support for each numbered tree topology. (b) Ten topologies and branch lengths were selected by bootstrapping the original 224 cytochrome-b data set and building a fast Neighbor-Joining tree. Full-sized tree diagrams are available for online viewing in the Supplementary Material online.

References

    1. Cormen TH. Introduction to algorithms. Cambridge (MA): The MIT Press; 2009.
    1. de Koning APJ, Gu W, Pollock DD. Ancestral sequence reconstruction in primate mitochondrial DNA: compositional bias and effect on functional inference (erratum) Mol Biol Evol. 2009;26:481. - PubMed
    1. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. - PubMed
    1. Gelman A, Rubin DB, Carlin JB, Stern HS. Bayesian data analysis. London, New York: Chapman and Hall; 1992.
    1. Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Machine Intel. 1984;6:721–741. - PubMed

Publication types