Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty

Guy Baele¹, Philippe Lemey, Trevor Bedford, Andrew Rambaut, Marc A Suchard, Alexander V Alekseyenko

Affiliations

PMID: 22403239
PMCID: PMC3424409
DOI: 10.1093/molbev/mss084

Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty

Guy Baele et al. Mol Biol Evol. 2012 Sep.

. 2012 Sep;29(9):2157-67.

doi: 10.1093/molbev/mss084. Epub 2012 Mar 7.

Authors

Guy Baele¹, Philippe Lemey, Trevor Bedford, Andrew Rambaut, Marc A Suchard, Alexander V Alekseyenko

Affiliation

¹ Department of Microbiology and Immunology, KU Leuven, Leuven, Belgium. guy.baele@rega.kuleuven.be

PMID: 22403239
PMCID: PMC3424409
DOI: 10.1093/molbev/mss084

Abstract

Recent developments in marginal likelihood estimation for model selection in the field of Bayesian phylogenetics and molecular evolution have emphasized the poor performance of the harmonic mean estimator (HME). Although these studies have shown the merits of new approaches applied to standard normally distributed examples and small real-world data sets, not much is currently known concerning the performance and computational issues of these methods when fitting complex evolutionary and population genetic models to empirical real-world data sets. Further, these approaches have not yet seen widespread application in the field due to the lack of implementations of these computationally demanding techniques in commonly used phylogenetic packages. We here investigate the performance of some of these new marginal likelihood estimators, specifically, path sampling (PS) and stepping-stone (SS) sampling for comparing models of demographic change and relaxed molecular clocks, using synthetic data and real-world examples for which unexpected inferences were made using the HME. Given the drastically increased computational demands of PS and SS sampling, we also investigate a posterior simulation-based analogue of Akaike's information criterion (AIC) through Markov chain Monte Carlo (MCMC), a model comparison approach that shares with the HME the appealing feature of having a low computational overhead over the original MCMC analysis. We confirm that the HME systematically overestimates the marginal likelihood and fails to yield reliable model classification and show that the AICM performs better and may be a useful initial evaluation of model choice but that it is also, to a lesser degree, unreliable. We show that PS and SS sampling substantially outperform these estimators and adjust the conclusions made concerning previous analyses for the three real-world data sets that we reanalyzed. The methods used in this article are now available in BEAST, a powerful user-friendly software package to perform Bayesian evolutionary analyses.

PubMed Disclaimer

Figures

**Fig. 1.**
Differences in log-marginal likelihood estimates and AICM for two independent fittings (first fitting shown in white and second in gray) of the HIV data set using the HME, posterior-simulation Akaike information content (AICM), PS, and SS sampling. For each estimator, the constant population size model (Con) was used as the reference model, and the top-performing model for each fitting is indicated with a star (*). For all estimators, we employ equal amounts of computational work (MCMC iterations) as well as an equal numbers of samples from which to estimate the marginal likelihood. The HME shows drastic differences in the overall ranking of the demographic models and, depending on the fitting, may very well select a constant population size as the preferred coalescent prior. The AICM is consistent across both fittings but selects a constant population size above all other coalescent priors. PS and SS consistently select the BSP coalescent prior as the optimal choice and put the constant population size far behind the other coalescent priors. PS and SS indicate that the expansion growth model (Expan) yields the second highest fit, whereas} the exponential (Expo) and logistic (Log) growth models yield similar performance.

**Fig. 2.**
Evaluation of log BF estimates using PS (SS yields an undistinguishable plot), AICM, and the HME to compare model fit, with four pairwise comparisons being shown: a constant population size versus an exponential population size with growth rates of 0.01, 0.025, 0.05, and 0.10. An increasingly strong discriminatory behavior (low false positive rates and high true positive rates) can be seen for PS (and SS) up to a growth rate of 0.10, whereas the HME retains questionable performance. AICM performance lies in between that of the HME and PS/SS. Color-coded area under the curve values are given at the bottom right of each plot.

**Fig. 3.**
Differences in log-marginal likelihood estimates for two independent fittings (first fitting shown in white and second in gray) for the HSV data set (Firth et al. 2010) using HME, AICM, PS, and SS using a SC, an uncorrelated relaxed clock with an exponential distribution (UCED), and an uncorrelated relaxed clock with a log-normal distribution (UCLD). The data were analyzed excluding the sampling dates (No) and including the sampling dates (Yes). We used the SC model excluding the sampling dates as the reference model and the top-performing model for each fitting is indicated with a star (*). Equal amounts of computational work (MCMC iterations) were run for all estimators as well as an equal number of posterior samples being used to estimate the marginal likelihood. While the HME shows drastic differences in the overall ranking of the (clock) models, the AICM as well as PS and SS exhibit consistent behavior, although disagreeing on the performance of a SC when the sampling dates are omitted.

See this image and copyright information in PMC

References

1. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F, editors. Second International Symposium on Information Theory. Budapest (Hungary): Akademia Kiado; 1973. pp. 267–281.
1. Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4:e88. - PMC - PubMed
1. Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics. 2002;161:1307–1320. - PMC - PubMed
1. Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol. 2006;22:1185–1192. - PubMed
1. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012 - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty

Affiliation

Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources