. 2014 May;63(3):309-21.

doi: 10.1093/sysbio/syt068. Epub 2013 Nov 4.

Posterior predictive Bayesian phylogenetic model selection

Paul O Lewis¹, Wangang Xie, Ming-Hui Chen, Yu Fan, Lynn Kuo

Affiliations

Affiliation

¹ Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Unit 3043, Storrs, CT 06269, USA; AbbVie, 1 N. Waukegan Road, R436/AP9A-2, North Chicago, IL 60064, USA; Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, CT 06269, USA; and Department of Bioinformatics and Computational Biology, Division of Quantitative Sciences, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA.

PMID: 24193892
PMCID: PMC3985471
DOI: 10.1093/sysbio/syt068

Posterior predictive Bayesian phylogenetic model selection

Paul O Lewis et al. Syst Biol. 2014 May.

. 2014 May;63(3):309-21.

doi: 10.1093/sysbio/syt068. Epub 2013 Nov 4.

Authors

Paul O Lewis¹, Wangang Xie, Ming-Hui Chen, Yu Fan, Lynn Kuo

Affiliation

¹ Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Unit 3043, Storrs, CT 06269, USA; AbbVie, 1 N. Waukegan Road, R436/AP9A-2, North Chicago, IL 60064, USA; Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, CT 06269, USA; and Department of Bioinformatics and Computational Biology, Division of Quantitative Sciences, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA.

PMID: 24193892
PMCID: PMC3985471
DOI: 10.1093/sysbio/syt068

Abstract

We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using examples from green algal protein-coding cpDNA sequences and flowering plant rDNA sequences. The Gelfand-Ghosh (GG) approach allows dissection of an overall measure of model fit into components due to posterior predictive variance (GGp) and goodness-of-fit (GGg), which distinguishes this method from the posterior predictive P-value approach. The conditional predictive ordinate (CPO) method provides a site-specific measure of model fit useful for exploratory analyses and can be combined over sites yielding the log pseudomarginal likelihood (LPML) which is useful as an overall measure of model fit. CPO provides a useful cross-validation approach that is computationally efficient, requiring only a sample from the posterior distribution (no additional simulation is required). Both GG and CPO add new perspectives to Bayesian phylogenetic model selection based on the predictive abilities of models and complement the perspective provided by the marginal likelihood (including Bayes Factor comparisons) based solely on the fit of competing models to observed data.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1. — **Figure 1.**
The notation y_i refers to a vector of length *n_c* that (assuming no missing data or ambiguities) has a single nonzero element (where *n_c* is the number of possible data patterns). This single nonzero element corresponds to the particular data pattern observed for site i and has value 1. The vector y₂ corresponding to site i = 2 is shown for illustration. The result (y_·) of summing y_i over all i = 1,2,··· *,n_s* is also shown (where *n_s* is the number of sites).

F<sc>igure</sc> 2. — **Figure 2.**
Generation of posterior predictive data sets under a HKY substitution model for a four-taxon problem. Numbers at the top represent the MCMC iteration at which a posterior sample was drawn, TL=tree length (sum of the five edge length parameters), κ =transition/transversion rate ratio, and *π_A*, *π_C*, *π_G*, *π_T* =nucleotide equilibrium relative frequencies. For each iteration, the sampled parameter values and tree are used to simulate one data set, the vector of pattern counts of which constitute .

formula image — **Figure 2.**
Generation of posterior predictive data sets under a HKY substitution model for a four-taxon problem. Numbers at the top represent the MCMC iteration at which a posterior sample was drawn, TL=tree length (sum of the five edge length parameters), κ =transition/transversion rate ratio, and *π_A*, *π_C*, *π_G*, *π_T* =nucleotide equilibrium relative frequencies. For each iteration, the sampled parameter values and tree are used to simulate one data set, the vector of pattern counts of which constitute .

F<sc>igure</sc> 3. — **Figure 3.**
Comparison of prior models using the GG criterion for the gene *psa*B in the algae example data set. Solid line with squares: the mean number of distinct data patterns in posterior predictive data sets. Solid line with open circles: the overall GG measure (smallest is best). Dashed line with downward-pointing triangles: the goodness-of-fit component (GG_g) of the GG criterion. Dotted line with upward-pointing triangle: the variance component GG_p of the GG criterion. Trees shown below the plot are the last tree sampled for each of the eight analyses. The prior mean used for each analysis is 10^x, where x is the abscissa, except for the point labeled “hyper” in which the exponential prior mean was a hyperparameter in a hierarchical model.

F<sc>igure</sc> 4. — **Figure 4.**
Plots of site-specific log(CPO) values for analyses of all four genes in the algae data set. a) Partitioned by gene (dotted vertical lines show gene boundaries). b) Partitioned by gene, but sorted by codon position, with 1st (left), 2nd position (center), and 3rd position (right) (dotted vertical lines show codon position boundaries). c) Partitioned by codon (dotted vertical lines show codon position boundaries).

F<sc>igure</sc> 5. — **Figure 5.**
Plot of the log of the CPO ratio, equal to log(CPO3) −log(CPO5), where CPO5 is site-specific CPO estimated using the 5′ tree (on left), and CPO3 is site-specific CPO estimated using the 3′ tree (on right). Vertical dotted line separates 5′ end (left) from 3′ end (right) of the ribosomal protein gene *rps11*.

F<sc>igure</sc> 6. — **Figure 6.**
Comparison of partition models for green algal protein-coding data. NONE is unpartitioned, GENE is partitioned by gene (four subsets), CODON is partitioned by codon position (three subsets), and BOTH is partitioned by both gene and codon position (12 subsets). Squares indicate GG, circles indicate SS, and triangles indicate LPML. Solid lines use scale on left; dotted line uses scale on right.

See this image and copyright information in PMC

Cited by

Identifying model violations under the multispecies coalescent model using P2C2M.SNAPP.
Duckett DJ, Pelletier TA, Carstens BC. Duckett DJ, et al. PeerJ. 2020 Jan 10;8:e8271. doi: 10.7717/peerj.8271. eCollection 2020. PeerJ. 2020. PMID: 31949994 Free PMC article.
Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy.
Duchêne DA, Duchêne S, Ho SYW. Duchêne DA, et al. Genome Biol Evol. 2018 Jun 1;10(6):1375-1388. doi: 10.1093/gbe/evy094. Genome Biol Evol. 2018. PMID: 29788113 Free PMC article.
Bayesian Total-Evidence Dating Reveals the Recent Crown Radiation of Penguins.
Gavryushkina A, Heath TA, Ksepka DT, Stadler T, Welch D, Drummond AJ. Gavryushkina A, et al. Syst Biol. 2017 Jan 1;66(1):57-73. doi: 10.1093/sysbio/syw060. Syst Biol. 2017. PMID: 28173531 Free PMC article.
Chloroplast Phylogenomic Inference of Green Algae Relationships.
Sun L, Fang L, Zhang Z, Chang X, Penny D, Zhong B. Sun L, et al. Sci Rep. 2016 Feb 5;6:20528. doi: 10.1038/srep20528. Sci Rep. 2016. PMID: 26846729 Free PMC article.
Assessing model adequacy for Bayesian Skyline plots using posterior predictive simulation.
Fonseca EM, Duckett DJ, Almeida FG, Smith ML, Thomé MTC, Carstens BC. Fonseca EM, et al. PLoS One. 2022 Jul 25;17(7):e0269438. doi: 10.1371/journal.pone.0269438. eCollection 2022. PLoS One. 2022. PMID: 35877611 Free PMC article.

See all "Cited by" articles

References

1. Akaike H. A new look at statistical model identification. IEEE Trans. Automat. Contr. 1974;19:716–723.
1. Arima S., Tardella L. Improved harmonic mean estimator for phylogenetic model evidence. J. Comput. Biol. 2012;19:418–438. - PubMed
1. Baele G., Lemey P., Bedford T., Rambaut A., Suchard M.A., Alekseyenko A.V. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol. Biol. Evol. 2012;29:2157–2167. - PMC - PubMed
1. Bergthorsson U., Adams K.L., Thomason B., Palmer J.D. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature. 2003;424:197–201. - PubMed
1. Bollback J. Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol. 2002;19:1171–1180. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- Dryad Digital Repository - Access Curated Datasets
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Posterior predictive Bayesian phylogenetic model selection

Affiliation

Posterior predictive Bayesian phylogenetic model selection

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources