Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 4:16:357.
doi: 10.1186/s12859-015-0810-y.

Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals

Affiliations

Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals

Steven H Wu et al. BMC Bioinformatics. .

Abstract

Background: Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny.

Results: We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95% Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95% Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95% HPDs are larger than those obtained by BEAST.

Conclusion: We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95% HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Flow chart of the full ABC-MCMC algorithm
Fig. 2
Fig. 2
Plot of 95 % Confidence Intervals and 95 % Highest Posterior Densities of population size recovered using the LS bootstrap, ABC-MCMC and BEAST. The green lines are the 95 % CIs of LS bootstraps, the red lines are the 95 % HPDs of ABC-MCMC, and the blue lines are the 95 % HPDs obtained using BEAST. The true value of population size is shown as a solid black line. Note that the vertical axis is measured on a log scale
Fig. 3
Fig. 3
Plot of 95 % Confidence Intervals and 95 % Highest Posterior Densities of mutation rate recovered using the LS bootstrap, ABC-MCMC and BEAST. The green lines are the 95 % CIs of LS bootstraps, the red lines are the 95 % HPDs of ABC-MCMC, and the blue lines are the 95 % HPDs obtained using BEAST. The true value of mutation rate is shown as a solid black line
Fig. 4
Fig. 4
Trace plot from ABC-MCMC for both effective population size and mutation rate after removing the first 10 % of the generations as burn-in. This demonstrates that the MCMC chain mixes well
Fig. 5
Fig. 5
The prior and posterior distributions for the effective population size from ABC-MCMC

References

    1. Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto, Calif) 2013;6:287–303. doi: 10.1146/annurev-anchem-062012-092628. - DOI - PubMed
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. van Dijk EL, Hln A, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014;30:418–426. doi: 10.1016/j.tig.2014.07.001. - DOI - PubMed
    1. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. Community genomics among stratified microbial assemblages in the ocean’s interior. Science. 2006;311:496–503. doi: 10.1126/science.1120250. - DOI - PubMed
    1. Edwards C, Holmes E, Wilson D, Viscidi R, Abrams E, et al. Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1. BMC Evol Biol. 2006;6:28–28. doi: 10.1186/1471-2148-6-28. - DOI - PMC - PubMed

Publication types

LinkOut - more resources