Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May;64(3):472-91.
doi: 10.1093/sysbio/syv006. Epub 2015 Jan 27.

Quantifying MCMC exploration of phylogenetic tree space

Affiliations

Quantifying MCMC exploration of phylogenetic tree space

Chris Whidden et al. Syst Biol. 2015 May.

Abstract

In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this article, we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric. This metric directly corresponds to the minimum number of MCMC rearrangements required to move between trees in common phylogenetic MCMC implementations. We develop a novel graph-based approach to analyze tree posteriors and find that the SPR metric is much more informative than simpler metrics that are unrelated to MCMC moves. In doing so, we show conclusively that topological peaks do occur in Bayesian phylogenetic posteriors from real data sets as sampled with standard MCMC approaches, investigate the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks, and show that conditional clade distribution (CCD) can have systematic problems when there are multiple peaks.

Keywords: Markov chain Monte Carlo; phylogenetic methods; subtree prune-and-regraft; topological peaks; tree space.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.
Figure 1.
An SPR move.
F<sc>igure</sc> 2.
Figure 2.
Distance SPR graphs of the combined bacterial and archaeal golden runs showing at most the 4096 topologies with highest posterior probability (8192 for VL6). Node areas are scaled relative to posterior probability (PP; larger = higher probability) within each graph (but not with respect to the other graphs). Node color indicates SPR distance from the topology with highest posterior probability in each data set on a red–yellow–white scale (dark-light in the print version), with the highest probability tree colored red.
F<sc>igure</sc> 3.
Figure 3.
Cluster SPR graphs of the combined golden run eukaryote posteriors. Each graph contains either the 95% credible set or the 4096 topologies with highest PP (DS5, DS6, and DS10). Nodes are scaled relative to posterior probability within each graph (but not with respect to the other graphs). Nodes are colored by SPR-based descending PP clusters (gray scale in the print version).
F<sc>igure</sc> 4.
Figure 4.
Central trees of the two topological peaks in data set DS1. Only two SPR operations separate these trees, moving the blue (gray in the print version) and then green (light gray) clade to traverse from peak 1 to peak 2 and vice versa in the reverse direction. However, the sole intermediate topology is so unlikely that it was never visited in any of our tests, inducing a severe topological bottleneck. Longer paths through multiple trees outside of the 95% confidence interval are taken instead, resulting in long transit times between the peaks.
F<sc>igure</sc> 5.
Figure 5.
Weighted MCMC graphs for DS1 and DS4. Node diameters are scaled relative to posterior probability. Nodes are colored on a red–yellow–white scale (dark-light in the print version) with increasing distance from the topology with highest posterior probability. Edges connect trees in successive 100-iteration samples. Edge thickness and color are proportional to the number of MCMC transitions.
F<sc>igure</sc> 6.
Figure 6.
Comparison of posterior probability and mean commute time with (gray) and without (black) Metropolis-coupling.
F<sc>igure</sc> 7.
Figure 7.
Comparison of multidimensional scaling representations with SPR and RF distances. Nodes are colored by identified peaks to match the primary cluster of the peak (gray scale in the print version).
F<sc>igure</sc> 8.
Figure 8.
A comparison of posterior probability and CCD estimates for the aggregated golden runs on data sets DS1, DS3, DS4, and DS6. Probability is shown on a log-log scale in base 10. The top trees for each data set are colored by peak in DS1 and cluster for the other data sets. Transparency of points increases as posterior probability decreases.

References

    1. Aldous D.J. Mixing time for a Markov chain on cladograms. Combin. Probab. Comput. 2000;9:191–204.
    1. Allen B.L., Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 2001;5:1–15.
    1. Beiko R.G., Keith J.M., Harlow T.J., Ragan M.A. Searching for convergence in phylogenetic Markov chain Monte Carlo. Syst. Biol. 2006;55:553–565. - PubMed
    1. Bordewich M., Semple C. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 2005;8:409–423.
    1. Bouckaert R., Heled J., Kühnert D., Vaughan T., Wu C.-H., Xie D., Suchard M.A., Rambaut A., Drummond A.J. BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 2014;10:e1003537. - PMC - PubMed

Publication types