Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution

Guy Baele¹, Philippe Lemey, Stijn Vansteelandt

Affiliations

PMID: 23497171
PMCID: PMC3651733
DOI: 10.1186/1471-2105-14-85

Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution

Guy Baele et al. BMC Bioinformatics. 2013.

. 2013 Mar 6:14:85.

doi: 10.1186/1471-2105-14-85.

Authors

Guy Baele¹, Philippe Lemey, Stijn Vansteelandt

Affiliation

¹ Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium. guy.baele@rega.kuleuven.be

PMID: 23497171
PMCID: PMC3651733
DOI: 10.1186/1471-2105-14-85

Abstract

Background: Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model's marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes.

Results: We here assess the original 'model-switch' path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model's marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process.

Conclusions: We show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.

PubMed Disclaimer

Figures

**Figure 1**
**Sigmoid shape comparison.** Comparison of different integration settings for the three log Bayes factor estimators (left), showing how a sigmoidal shape with α=6.0 is closest to our flexible-increment approach, while α=10.0 yields a curve that slowly converges towards both ends of the integration interval. The constant-increment approach is clearly a too rude approximation of a path between the two models. Bidirectional errors for a sigmoidal shape of α=8.0 (middle), showing that such a curve yields large errors towards both priors and that a higher shape value would be preferred. Bidirectional errors for a sigmoidal shape of α=12.0 (right), showing that such a curve yields larger errors in the middle of the integration interval although nowhere near the errors towards the priors for α=8.0.

**Figure 2**
**Model comparison using path sampling (PS) and stepping-stone sampling (SS) for the Laurasiatheria data set.** Laurasiatheria data set: visual comparison of annealing and melting estimates (shown side by side) for the log Bayes factor for different sigmoid shape values. For α=10.0 and α=12.0, these estimates are available for K=2.000(Q=200 and Q=400),4.000,8.000 and16.000, while for α=8.0, α=9.0, α=11.0 and α=13.0 this last value has not been examined. Each subfigure shows annealing and melting estimates for a particular sigmoid shape value, and this for the three estimators discussed: path sampling, the extension of path sampling that uses the mean of a series of samples for each power posterior, and stepping-stone sampling.

**Figure 3**
**Comparison of annealing and melting estimates with increasing computational settings.** Laurasiatheria data set: visual comparison of the bidirectional mean log Bayes factor, estimated using stepping-stone sampling, for each sigmoid shape value with the corresponding intervals composed of both annealing and melting estimates. In general, these intervals decrease in width with increasing computational settings.

**Figure 4**
**Bidirectional errors.** Laurasiatheria data set: visual comparison of the bidirectional errors associated with each sigmoid shape value for the three estimators presented in this manuscript. It can be seen that sigmoid shape values between 10.0 and 12.0 are preferred for the Laurasiatheria data set.

See this image and copyright information in PMC

References

1. Baele G. Context-dependent evolutionary models for non-coding sequences: an overview of several decades of research and an analysis of Laurasiatheria and Primate evolution. Evol Biol. 2012;39:61–82. doi: 10.1007/s11692-011-9139-2. - DOI
1. Baele G, Van de Peer Y, Vansteelandt S. A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. Syst Biol. 2008;57(5):675–692. doi: 10.1080/10635150802422324. - DOI - PubMed
1. Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol. 1997;14(7):717–724. doi: 10.1093/oxfordjournals.molbev.a025811. - DOI - PubMed
1. Suchard MA, Weiss RE, Sinsheimer JS. Bayesian selection of continuous-time Markov chain evolutionary models. Mol Biol Evol. 2001;18(6):1001–1013. doi: 10.1093/oxfordjournals.molbev.a003872. - DOI - PubMed
1. Steel MA. Should phylogenetic models be trying to ‘fit an elephant’? Trends Genet. 2005;21(6):307–309. doi: 10.1016/j.tig.2005.04.001. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution

Affiliation

Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources