Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 4;10(12):e1003919.
doi: 10.1371/journal.pcbi.1003919. eCollection 2014 Dec.

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration

Affiliations

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration

Alexandra Gavryushkina et al. PLoS Comput Biol. .

Abstract

Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after sampling, in particular we extend the birth-death skyline model [Stadler et al., 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that even if sampled ancestors are not of specific interest in an analysis, failing to account for them leads to significant bias in parameter estimates. We also show that sampled ancestor birth-death models where every sample comes from a different time point are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply our phylogenetic inference accounting for sampled ancestors to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included along with extant species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in the literature. The sampler is available as an open-source BEAST2 package (https://github.com/CompEvol/sampled-ancestors).

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Full tree versus reconstructed tree.
A full tree produced by the sampled ancestor birth-death process on the left and a reconstructed tree on the right. The sampled nodes are indicated by dots labeled by letters A through H. Nodes A, B and D are sampled ancestors. The reconstructed tree is represented by a sampled ancestor tree formula image, where formula image denotes the ranked tree topology and formula image, formula image, and formula image denote the node ages. In the reconstructed tree the root is a sampled node. In the skyline model, birth-death parameters vary from interval to interval. There are two intervals in this figure bounded by the time of origin formula image, parameter shift time formula image, and present time formula image. Between formula image and formula image parameters formula image, formula image, formula image and formula image apply and between formula image and formula image parameters formula image, formula image, formula image, and formula image. There are additional sampling attempts at times formula image and formula image with sampling probabilities formula image and formula image.
Figure 2
Figure 2. The Wilson Balding operator.
The operator proposes a sampled ancestor tree topology and node ages and may propose a tree of larger or smaller dimension (the number of nodes in the tree) than the original tree. First, it prunes a subtree rooted at edge formula image (blue edge) either from a branch, coloured black, in case a.1 or from a node, coloured black, in case a.2. Then it attaches the subtree either to an edge formula image (black edge) at a random height in case b.1 or to a leaf formula image (black node) in case b.2. Case a.1 followed by b.2 removes a node from the tree and case a.2 followed by b.1 introduces a new node into the tree.
Figure 3
Figure 3. Properties of the tree estimated from simulated data (fossilized birth-death process).
The graph shows median estimates (black dots) and 95% HPD intervals (grey lines) against true values for the tree height (on the left) and number of sampled ancestors (on the right). The upper row shows the estimates obtained from the analyses of simulated sequence data of all sampled nodes and the bottom row shows the estimates from the analyses where only sequence data from the extant samples was used.
Figure 4
Figure 4. Uncertainty in estimates for simulated data (fossilized birth-death process).
The graph shows the widths of relative 95% HPD intervals of the turnover rate, formula image, against tree sizes for simulated fossilized birth-death process. The black dots are the interval widths for posterior distributions obtained from the analyses of simulated sequence data of all sampled nodes and the red triangles are the interval widths from the analyses of sequence data of only extant samples.
Figure 5
Figure 5. Parameter estimates for simulated data (transmission process).
The graph shows median estimates (black dots) and 95% HPD intervals (grey lines) against true values for the turnover rate, formula image, (on the left) and removal probability, formula image, (on the right).
Figure 6
Figure 6. ROC curve for identifying sampled ancestors based on simulated data (transmission process).
The posterior distribution of trees obtained from a Bayesian MCMC analysis of simulated sequence data can be used to detect sampled ancestors. We identify a node as being a sampled ancestor if the posterior probability that the node is a sampled ancestor is greater than some threshold. The curve is parameterised by the threshold and shows the trade-off between true positive rate (sensitivity) and false positive rate (specificity) for different values of the threshold (any increase in sensitivity will be accompanied by a decrease in specificity). The dashed diagonal line corresponds to a ‘random guess’ test. The closer the ROC curve to the upper-left boarder of the ROC space (the whole area of the graph), the more accurate the test. The optimal value of the threshold for this curve is 0.45.
Figure 7
Figure 7. Divergence time estimates for the bear dataset.
The estimates are obtained from the analyses with DPPDiv (left bars with blue dots) and BEAST2 (right bars with red dots) implementations of the fossilised birth-death model, which give the same results. The bars are 95% HPD intervals and the dots are mean estimates. The node numbering follows the original analysis : nodes 1 and 2 represent the most recent common ancestors of the bear clade and two outgroups (gray wolf and spotted seal). Node 3 is the most recent common ancestor of all living bear species and nodes 4-9 are the divergence times within the bear clade.
Figure 8
Figure 8. A tree sampled from the posterior of the HIV 1 dataset analysis.
The tree exhibits three estimated sampled ancestors shown as red circles. The samples with positive posterior probabilities of being sampled ancestors are shown in colour (red for the nodes with evidence of being sampled ancestors and blue for other nodes with non-zero probabilities) with the posterior probabilities in round brackets.

References

    1. Yang Z, Rannala B (1997) Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol 14: 717–24. - PubMed
    1. Mau B, Newton MA, Larget B (1999) Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55: 1–12. - PubMed
    1. Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17: 754–755. - PubMed
    1. Drummond AJ, Suchard MA, Xie D, Rambaut A (2012) Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol 29: 1969–73. - PMC - PubMed
    1. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, et al. (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61: 539–42. - PMC - PubMed

Publication types