Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May;63(3):309-21.
doi: 10.1093/sysbio/syt068. Epub 2013 Nov 4.

Posterior predictive Bayesian phylogenetic model selection

Affiliations

Posterior predictive Bayesian phylogenetic model selection

Paul O Lewis et al. Syst Biol. 2014 May.

Abstract

We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using examples from green algal protein-coding cpDNA sequences and flowering plant rDNA sequences. The Gelfand-Ghosh (GG) approach allows dissection of an overall measure of model fit into components due to posterior predictive variance (GGp) and goodness-of-fit (GGg), which distinguishes this method from the posterior predictive P-value approach. The conditional predictive ordinate (CPO) method provides a site-specific measure of model fit useful for exploratory analyses and can be combined over sites yielding the log pseudomarginal likelihood (LPML) which is useful as an overall measure of model fit. CPO provides a useful cross-validation approach that is computationally efficient, requiring only a sample from the posterior distribution (no additional simulation is required). Both GG and CPO add new perspectives to Bayesian phylogenetic model selection based on the predictive abilities of models and complement the perspective provided by the marginal likelihood (including Bayes Factor comparisons) based solely on the fit of competing models to observed data.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.
Figure 1.
The notation yi refers to a vector of length nc that (assuming no missing data or ambiguities) has a single nonzero element (where nc is the number of possible data patterns). This single nonzero element corresponds to the particular data pattern observed for site i and has value 1. The vector y2 corresponding to site i = 2 is shown for illustration. The result (y·) of summing yi over all i = 1,2,··· ,ns is also shown (where ns is the number of sites).
F<sc>igure</sc> 2.
Figure 2.
Generation of posterior predictive data sets formula image under a HKY substitution model for a four-taxon problem. Numbers at the top represent the MCMC iteration at which a posterior sample was drawn, TL=tree length (sum of the five edge length parameters), κ =transition/transversion rate ratio, and πA, πC, πG, πT =nucleotide equilibrium relative frequencies. For each iteration, the sampled parameter values and tree are used to simulate one data set, the vector of pattern counts of which constitute formula image.
F<sc>igure</sc> 3.
Figure 3.
Comparison of prior models using the GG criterion for the gene psaB in the algae example data set. Solid line with squares: the mean number of distinct data patterns in posterior predictive data sets. Solid line with open circles: the overall GG measure (smallest is best). Dashed line with downward-pointing triangles: the goodness-of-fit component (GGg) of the GG criterion. Dotted line with upward-pointing triangle: the variance component GGp of the GG criterion. Trees shown below the plot are the last tree sampled for each of the eight analyses. The prior mean used for each analysis is 10x, where x is the abscissa, except for the point labeled “hyper” in which the exponential prior mean was a hyperparameter in a hierarchical model.
F<sc>igure</sc> 4.
Figure 4.
Plots of site-specific log(CPO) values for analyses of all four genes in the algae data set. a) Partitioned by gene (dotted vertical lines show gene boundaries). b) Partitioned by gene, but sorted by codon position, with 1st (left), 2nd position (center), and 3rd position (right) (dotted vertical lines show codon position boundaries). c) Partitioned by codon (dotted vertical lines show codon position boundaries).
F<sc>igure</sc> 5.
Figure 5.
Plot of the log of the CPO ratio, equal to log(CPO3) −log(CPO5), where CPO5 is site-specific CPO estimated using the 5′ tree (on left), and CPO3 is site-specific CPO estimated using the 3′ tree (on right). Vertical dotted line separates 5′ end (left) from 3′ end (right) of the ribosomal protein gene rps11.
F<sc>igure</sc> 6.
Figure 6.
Comparison of partition models for green algal protein-coding data. NONE is unpartitioned, GENE is partitioned by gene (four subsets), CODON is partitioned by codon position (three subsets), and BOTH is partitioned by both gene and codon position (12 subsets). Squares indicate GG, circles indicate SS, and triangles indicate LPML. Solid lines use scale on left; dotted line uses scale on right.

Similar articles

Cited by

References

    1. Akaike H. A new look at statistical model identification. IEEE Trans. Automat. Contr. 1974;19:716–723.
    1. Arima S., Tardella L. Improved harmonic mean estimator for phylogenetic model evidence. J. Comput. Biol. 2012;19:418–438. - PubMed
    1. Baele G., Lemey P., Bedford T., Rambaut A., Suchard M.A., Alekseyenko A.V. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol. Biol. Evol. 2012;29:2157–2167. - PMC - PubMed
    1. Bergthorsson U., Adams K.L., Thomason B., Palmer J.D. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature. 2003;424:197–201. - PubMed
    1. Bollback J. Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol. 2002;19:1171–1180. - PubMed

Publication types