Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 1;34(8):2101-2114.
doi: 10.1093/molbev/msx126.

StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates

Affiliations

StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates

Huw A Ogilvie et al. Mol Biol Evol. .

Abstract

Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5× and 13.8× respectively when analyzing two empirical data sets, and an average of 33.1× across 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.

Keywords: concatenation; incomplete lineage sorting; multispecies coalescent; phylogenetic methods; relaxed clocks; species trees.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Two-species phylogeny used to illustrate species tree relaxed clocks. There are two extant species “A” and “B”, and one ancestral species “AB.” Within the species tree there is a single gene tree with extant individuals “a” and “b.” The single speciation event occurs at time T1, and the single coalescence event occurs at time t1. Gene tree rates are computed according to table 1.
<sc>Fig</sc>. 2.
Fig. 2.
Accuracy of branch substitution rates and lengths inferred by BEAST concatenation and StarBEAST2. Deviation is the difference of each estimated rate and length from the true value. Estimated rates and lengths are the posterior expectation of the overall substitution rate and length for each species tree branch. Black crosses in each panel indicate the point of perfect accuracy. Each panel shows the distributions for the labeled extant or ancestral branch. N = 96.
<sc>Fig</sc>. 3.
Fig. 3.
Impact of operators, population size integration and clock models on convergence. The estimated sample size (ESS) per hour for a given replicate used the smallest ESS out of all recorded statistics. Topology refers to the replacement of naïve nearest-neighbor interchange and subtree prune and regraft operators with coordinated operators. Height refers to the addition of operators which make coordinated changes to node heights. Uncorrelated log-normal relaxed clocks were applied to each gene tree (GT-UCLN) or to the species tree (ST-UCLN). N = 30.
<sc>Fig</sc>. 4.
Fig. 4.
Convergence of different methods applied to simulated and empirical data sets. The estimated sample size (ESS) per hour for a given replicate used the slowest ESS rate out of all recorded statistics. Methods are BEAST concatenation, *BEAST, and StarBEAST2 with uncorrelated log-normal relaxed clocks applied to each gene tree (GT-UCLN) or to the species tree (ST-UCLN). Two Pseudacris *BEAST outliers with ESS rates below 0.1 are not shown. N = 30.
<sc>Fig</sc>. 5.
Fig. 5.
Coverage and accuracy of species branch lengths using different methods. Methods are StarBEAST2, *BEAST, and BEAST concatenation with uncorrelated log-normal relaxed clocks applied to each gene tree (GT-UCLN) or to the species tree (ST-UCLN). (A, B) The percentages of true branch lengths present within the corresponding 95% highest posterior density (HPD) credible intervals. (C, D) The difference between the sum of estimated branch lengths and the sum of true branch lengths as a percentage of the sum of true branch lengths. (E, F) The sum of absolute differences between estimated and simulated branch lengths as a percentage of true tree length. N = 30.
<sc>Fig</sc>. 6.
Fig. 6.
Coverage and accuracy of species tree topologies using different methods. Methods are StarBEAST2, *BEAST, and BEAST concatenation with uncorrelated log-normal relaxed clocks applied to each gene tree (GT-UCLN) or to the species tree (ST-UCLN). (A) The percentage of true species tree topologies within the 95% credible set of topologies. (B) The average rooted Robinson–Foulds (RF) distance between the maximum clade credibility (MCC) species tree topology and the simulated true topology. Error bars are 95% confidence intervals calculated by bootstrapping. N = 30.
<sc>Fig</sc>. 7.
Fig. 7.
Estimates of species tree branch rates using BEAST concatenation versus StarBEAST2. Estimated rates are the posterior expectations of each branch rate from each replicate. Root branch rates, which were fixed at 1, were excluded. In blue are simple linear regression lines of best fit, and in red are the y=x lines showing a perfect relationship between estimates and truth. N = 30.

Similar articles

Cited by

References

    1. Aberer AJ, Kobert K, Stamatakis A.. 2014. ExaBayes: massively parallel Bayesian tree inference for the whole-genome era. Mol Biol Evol. 3110:2553–2556. - PMC - PubMed
    1. Andrieu C, Thoms J.. 2008. A tutorial on adaptive MCMC. Stat Comput. 184:343–373.
    1. Arbogast BS, Edwards SV, Wakeley J, Beerli P, Slowinski JB.. 2002. Estimating divergence times from molecular data on phylogenetic and population genetic timescales. Annu Rev Ecol Syst. 33:707–740.
    1. Baer CF, Miyamoto MM, Denver DR.. 2007. Mutation rate variation in multicellular eukaryotes: causes and consequences. Nat Rev Genet. 88:619–631. - PubMed
    1. Barrow LN, Ralicki HF, Emme SA, Lemmon EM.. 2014. Species tree estimation of North American chorus frogs (Hylidae: Pseudacris) with parallel tagged amplicon sequencing. Mol Phylogenet Evol. 75:78–90. - PubMed

Publication types