Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 24:17:253.
doi: 10.1186/s12859-016-1132-4.

EmpPrior: using outside empirical data to inform branch-length priors for Bayesian phylogenetics

Affiliations

EmpPrior: using outside empirical data to inform branch-length priors for Bayesian phylogenetics

John J Andersen et al. BMC Bioinformatics. .

Abstract

Background: Branch-length parameters are a central component of phylogenetic models and of intrinsic biological interest. Default branch-length priors in some Bayesian phylogenetic software can be unintentionally informative and lead to branch- and tree-length estimates that are unreasonable. Alternatively, priors may be uninformative, but lead to diffuse posterior estimates. Despite the widespread availability of relevant datasets from other groups, biologists rarely leverage outside information to specify branch-length priors that are specific to the analysis they are conducting.

Results: We developed the software package EmpPrior to facilitate the collection and incorporation of relevant, outside information when setting branch-length priors for phylogenetics. EmpPrior efficiently queries TreeBASE to find data that are similar to focal data, in terms of taxonomic and genetic sampling, and uses them to inform branch-length priors for the focal analysis. EmpPrior consists of two components: EmpPrior-search, written in Java to query TreeBASE, and EmpPrior-fit, written in R to parameterize branch-length distributions. In an example analysis, we show how the use of relevant, outside data is made possible by EmpPrior and improves tree-length estimates from a focal dataset.

Conclusion: EmpPrior is easy to use, fast, and improves both the accuracy and precision of branch-length estimates in many circumstances. While EmpPrior's focus is on branch lengths, the strategy it employs could easily be extended to address other prior parameterization problems in phylogenetics.

Keywords: Bayesian phylogenetics; Branch lengths; Informed priors.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Flowchart for generating informed branch-length priors with EmpPrior. EmpPrior-search queries TreeBASE to find data similar to the focal data. Outside data are then used as input for maximum-likelihood (ML) tree searches. Branch-length distributions are fit to ML trees in EmpPrior-fit and parameter estimates are used to set priors for analysis of the focal data
Fig. 2
Fig. 2
EmpPrior-search graphical user interface (GUI). The EmpPrior-search GUI allows users to specify the gene name and constraints on the number of taxa in a series of text fields at the bottom. These restrictions help to ensure that datasets returned from the search can provide relevant information to inform analysis of the focal data. A window in the middle of the GUI logs information about the progress of the TreeBase search and post-processing of datasets. A progress bar at the top provides users with a rough idea of EmpPrior-search’s progress. An optional post-processing step can be turned on with a radio button at the bottom, causing EmpPrior-search to attempt to extract the gene of interest from a multi-gene dataset. Due to inconsistencies in gene naming and data file formatting, this step can sometimes produce unreliable results. Users should always manually inspect relevant datasets to ensure that they have been parsed properly
Fig. 3
Fig. 3
Log-likelihood surfaces for c and α of the compound Dirichlet branch-length distribution. Both log-likelihood surfaces were calculated using maximum-likelihood (ML) branch lengths based on a dataset of cytochrome b and 16S sequences from alpine newts (Mesotriton alpestris) with TreeBase Study ID S1777 [11]. The left plot shows log-likelihoods based on the compound Dirichlet distribution [3] for different values of the internal:external branch-length ratio (c) with all other parameters fixed. The right plot shows log-likelihoods for different values of the concentration parameter (α) with all other parameters fixed. The dashed line in each plot shows the ML estimate for each parameter returned by EmpPrior-fit

References

    1. Brown JM, Hedtke SM, Lemmon AR, Lemmon EM. When trees grow too long: Investigating the causes of highly inaccurate Bayesian branch-length estimates. Syst Biol. 2010;59:145–161. doi: 10.1093/sysbio/syp081. - DOI - PubMed
    1. Marshall DC. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. Syst Biol. 2010;59:108–117. doi: 10.1093/sysbio/syp080. - DOI - PubMed
    1. Rannala B, Zhu T, Yang Z. Tail paradox, partial identifiability and influential priors in Bayesian branch length inference. Mol Biol Evol. 2012;29:325–335. doi: 10.1093/molbev/msr210. - DOI - PubMed
    1. Zhang C, Rannala B, Yang Z. Robustness of compound Dirichlet priors for Bayesian inference of branch lengths. Syst Biol. 2012;61:779–784. doi: 10.1093/sysbio/sys030. - DOI - PubMed
    1. Liang L-J, Weiss RE, Redelings B, Suchard MA. Improving phylogenetic analyses by incorporating additional information from genetic sequence databases. Bioinformatics. 2009;25:2530–6. doi: 10.1093/bioinformatics/btp473. - DOI - PMC - PubMed

Publication types