Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 2;2(1):vew003.
doi: 10.1093/ve/vew003. eCollection 2016 Jan.

The effects of sampling strategy on the quality of reconstruction of viral population dynamics using Bayesian skyline family coalescent methods: A simulation study

Affiliations

The effects of sampling strategy on the quality of reconstruction of viral population dynamics using Bayesian skyline family coalescent methods: A simulation study

Matthew D Hall et al. Virus Evol. .

Abstract

The ongoing large-scale increase in the total amount of genetic data for viruses and other pathogens has led to a situation in which it is often not possible to include every available sequence in a phylogenetic analysis and expect the procedure to complete in reasonable computational time. This raises questions about how a set of sequences should be selected for analysis, particularly if the data are used to infer more than just the phylogenetic tree itself. The design of sampling strategies for molecular epidemiology has been a neglected field of research. This article describes a large-scale simulation exercise that was undertaken to select an appropriate strategy when using the GMRF skygrid, one of the Bayesian skyline family of coalescent methods, in order to reconstruct past population dynamics. The simulated scenarios were intended to represent sampling for the population of an endemic virus across multiple geographical locations. Large phylogenies were simulated under a coalescent or structured coalescent model and sequences simulated from these trees; the resulting datasets were then downsampled for analyses according to a variety of schemes. Variation in results between different replicates of the same scheme was not insignificant, and as a result, we recommend that where possible analyses are repeated with different datasets in order to establish that elements of a reconstruction are not simply the result of the particular set of samples selected. We show that an individual stochastic choice of sequences can introduce spurious behaviour in the median line of the skygrid plot and that even marginal likelihood estimation can suggest complicated dynamics that were not in fact present. We recommend that the median line should not be used to infer historical events on its own. Sampling sequences with uniform probability with respect to both time and spatial location (deme) never performed worse than sampling with probability proportional to the effective population size at that time and in that location and frequently was superior. As a result, we recommend this approach in the design of future studies. We also confirm that the inclusion of many recent sequences from a single geographical location in an analysis tends to result in a spurious bottleneck effect in the reconstruction and caution against interpreting this as genuine.

Keywords: coalescent; phylodynamics; sampling; simulation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Depiction of the population structure used in structured coalescent simulations. Circles represent demes; two are small, two medium, and two large. Thick arrows represent fast rates of movement between demes (0.05 transitions per lineage per year) and thin arrows slower rates (0.025 per lineage per year).
Figure 2.
Figure 2.
Overlaid median lines of reconstructed skygrid plots for 50 replicates of the uniform temporal sampling scheme in Scenario 1. The black line is the true population size.
Figure 3.
Figure 3.
KDEs for the distribution of statistics indicating the accuracy and precision of the skygrid reconstructions in Scenario 1.
Figure 4.
Figure 4.
Skygrid reconstructions for the 50 replicates of the uniform sampling scheme in Scenario 1, sorted by increasing percent error. The black line is the true EPS, the dark blue line the median estimate, and the 95 per cent HPD region is in light blue.
Figure 5.
Figure 5.
Scatter plots depicting the relationship between sample size and statistics used to evaluate the performance of the skygrid reconstruction for 100 replicates of the uniform sampling scheme in Scenario 1. The red line represents the best-fit model determined by weighted least squares regression and corrected Akaike information criterion and the grey lines the limit of the 95 per cent confidence interval given by the fitted values of the error term of this model. (a) plots adjusted percent error against sample size and (b) HPD size against sample size.
Figure 6.
Figure 6.
KDEs for the distribution of statistics indicating the accuracy and precision of the skygrid reconstructions in Scenario 2. Distributions are coloured by sampling scheme.
Figure 7.
Figure 7.
KDEs for the distribution of statistics indicating the accuracy and precision of the skygrid reconstructions in Scenario 3. Distributions are coloured by sampling scheme.
Figure 8.
Figure 8.
Overlaid median lines of 50 reconstructed skygrid plots for Scenario 5, where additional samples are selected from one deme in the last 0.25 years of the timeline. The black line is the true population size. Plot titles refer to the oversampled deme and its size.
Figure 9.
Figure 9.
KDEs for the distribution of statistics indicating the accuracy and precision of the skygrid reconstructions in Scenario 4. Each graph plots the distributions of each statistic for several temporal sampling schemes when the spatial sampling scheme is fixed.
Figure 10.
Figure 10.
KDEs for the distribution of statistics indicating the accuracy and precision of the skygrid reconstructions in Scenario 4. Each graph plots the distributions of each statistic for several spatial sampling schemes when the temporal sampling scheme is fixed.
Figure 11.
Figure 11.
KDEs for the distribution of statistics indicating the accuracy and precision of the skygrid reconstructions in Scenario 5. Distributions are coloured by sampling scheme.

Similar articles

Cited by

References

    1. Baele G., et al. (2013) ‘Accurate Model Selection of Relaxed Molecular Clocks in Bayesian Phylogenetics’, Molecular Biology and Evolution, 30: 239–43. - PMC - PubMed
    1. Bielejec F., et al. (2014) ‘πBUSS: A Parallel BEAST/BEAGLE Utility for Sequence Simulation Under Complex Evolutionary Scenarios’, BMC Bioinformatics, 15: 133. - PMC - PubMed
    1. Chikhi L., et al. (2010). ‘The Confounding Effects of Population Structure, Genetic Diversity and the Sampling Scheme on the Detection and Quantification of Population Size Changes’, Genetics, 186: 983–95. - PMC - PubMed
    1. De Maio N., et al. (2015) ‘New Routes to Phylogeography: A Bayesian Structured Coalescent Approximation’, PLoS Genetics, 11: e1005421. - PMC - PubMed
    1. de Silva E., Ferguson N. M., Fraser C. (2012) ‘Inferring Pandemic Growth Rates from Sequence Data’, Journal of The Royal Society Interface, 9: 1797–808. - PMC - PubMed