Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST

Guy Baele¹, Philippe Lemey¹, Andrew Rambaut^{2

3}, Marc A Suchard^{4

5

6}

Affiliations

¹ Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium.
² Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK.
³ Centre for Immunology, Infection and Evolution, University of Edinburgh, Ashworth Laboratories, King's Buildings, Edinburgh, UK.
⁴ Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
⁵ Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA, USA.
⁶ Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.

PMID: 28200071
PMCID: PMC6044345
DOI: 10.1093/bioinformatics/btx088

Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST

Guy Baele et al. Bioinformatics. 2017.

. 2017 Jun 15;33(12):1798-1805.

doi: 10.1093/bioinformatics/btx088.

Authors

Guy Baele¹, Philippe Lemey¹, Andrew Rambaut^{2

3}, Marc A Suchard^{4

5

6}

Affiliations

¹ Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium.
² Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK.
³ Centre for Immunology, Infection and Evolution, University of Edinburgh, Ashworth Laboratories, King's Buildings, Edinburgh, UK.
⁴ Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
⁵ Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA, USA.
⁶ Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.

PMID: 28200071
PMCID: PMC6044345
DOI: 10.1093/bioinformatics/btx088

Abstract

Motivation: Advances in sequencing technology continue to deliver increasingly large molecular sequence datasets that are often heavily partitioned in order to accurately model the underlying evolutionary processes. In phylogenetic analyses, partitioning strategies involve estimating conditionally independent models of molecular evolution for different genes and different positions within those genes, requiring a large number of evolutionary parameters that have to be estimated, leading to an increased computational burden for such analyses. The past two decades have also seen the rise of multi-core processors, both in the central processing unit (CPU) and Graphics processing unit processor markets, enabling massively parallel computations that are not yet fully exploited by many software packages for multipartite analyses.

Results: We here propose a Markov chain Monte Carlo (MCMC) approach using an adaptive multivariate transition kernel to estimate in parallel a large number of parameters, split across partitioned data, by exploiting multi-core processing. Across several real-world examples, we demonstrate that our approach enables the estimation of these multipartite parameters more efficiently than standard approaches that typically use a mixture of univariate transition kernels. In one case, when estimating the relative rate parameter of the non-coding partition in a heterochronous dataset, MCMC integration efficiency improves by > 14-fold.

Availability and implementation: Our implementation is part of the BEAST code base, a widely used open source software package to perform Bayesian phylogenetic inference.

Contact: guy.baele@kuleuven.be.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Performance comparison on a single gene carnivores dataset, partitioned according to the codon position, across five replicates measured on 24-core and 40-core Xeon systems. The 24-core CPU system, while equipped with fewer processor cores than the 40-core CPU system, has a faster maximum processor frequency and comes equipped with much faster memory, explaining the difference in performance as measured in ESS per time unit. Mixing of all parameters of interest is compared using the default BEAST transition kernels, our proposed AVMVN transition kernel and our proposed AVMVN transition kernel that takes advantages of our proposed load-balancing approach to further exploit multi-core parallelism (AVMVN + LB). All update schemes assign an equal weight distribution between updating continuous parameters and updating the tree. The AVMVN transition kernel, equipped with our load-balancing approach, yields an increase in performance over the default BEAST transition kernels between 171 and 424%, measured in ESS/minute, on a 24-core CPU system and between 221 and 520%, measured in ESS/minute, on a 40-core CPU system

**Fig. 2.**
Performance comparison on a full genome Ebola virus dataset, partitioned according to the codon position, across five replicates measured on 24-core and 40-core Xeon systems. Mixing of all parameters of interest is compared between the default BEAST transition kernels, the AVMVN transition kernel and the AVMVN transition kernel that takes advantages of a load-balancing approach to further exploit multi-core parallelism (AVMVN + LB). All update schemes assign an equal weight distribution between updating continuous parameters and updating the tree. Relative to the default BEAST transition kernels, the performance of the AVMVN transition kernel, equipped with our load-balancing approach, increases with between 76% and 1057%, measured in ESS/minute, on a 24-core CPU system and between 134 and 1452% (for μ₄, the relative rate of the non-coding partition), measured in ESS/hour, on a 40-core CPU system

**Fig. 3.**
Performance of the AVMVN transition kernel as a function of the number of cores in a multi-core CPU setup, measured in time to run the analyses performed (in minutes for the carnivores dataset and in hours for the Ebola virus dataset) across five independent replicates. Both CPU systems we evaluate show the same trend, i.e. the run time decreases systematically when additional cores are used, until a saturation point is reached where creating additional partitions no longer increases performance due to an associated increase in overhead

See this image and copyright information in PMC

References

1. Ayres D.L. et al. (2012) BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst. Biol, 61, 170–173. - PMC - PubMed
1. Baele G., Lemey P. (2013) Bayesian evolutionary model testing in the phylogenomics era: matching model complexity with computational efficiency. Bioinformatics, 29, 1970–1979. - PubMed
1. Baele G. et al. (2013) Accurate model selection of relaxed molecular clocks in Bayesian phylogenetics. Mol. Biol. Evol, 30, 239–243. - PMC - PubMed
1. Drummond A.J. et al. (2012) Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol, 29, 1969–1973. - PMC - PubMed
1. Ferreira M.A.R., Suchard M.A. (2008) Bayesian anaylsis of elasped times in continuous-time Markov chains. Canadian Journal of Statistics, 26, 355–368.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST

Affiliations

Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous