Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb;25(2):268-79.
doi: 10.1101/gr.178756.114. Epub 2015 Jan 6.

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Affiliations

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Anand Bhaskar et al. Genome Res. 2015 Feb.

Abstract

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Population size N(t) as a function of time (measured in generations) for (A) several choices of t1 and r1 in Scenario 1, and (B) Scenario 2. The present time corresponds to t = 0.
Figure 2.
Figure 2.
Performance of our method on simulated data. Each violin plot is generated using 100 simulated data sets with 100 unlinked loci of 10 kb each over 10,000 diploid individuals. The gray solid horizontal lines indicate the true values for the simulation parameters. The median inferred parameter values, indicated by dashed black lines, match the true parameter values very well. Panels A and B, respectively, show violin plots of the duration and rate of exponential growth in the population size for each of the six simulation parameter settings of Scenario 1, illustrated in Figure 1A. Panels C and D show violin plots of the onset times (t1 and t2) and exponential growth rates (r1 and r2) for the two epochs of exponential growth in Scenario 2, illustrated in Figure 1B.
Figure 3.
Figure 3.
Performance of fastsimcoal (Excoffier et al. 2013) on simulated data for Scenario 1. Panels A and B, respectively, show violin plots of the inferred duration and rate of exponential growth in the population size for 100 simulated data sets for each of the simulation parameter settings in Scenario 1. These are the same simulated data sets used to generate Figure 2, A and B and Supplemental Figure S1. The gray solid horizontal lines indicate the true values for the simulation parameters. When applying fastsimcoal, due to computational reasons we used 200 and 500 coalescent tree simulations for Scenario 1 and Scenario 2 per likelihood function estimation and limited the number of rounds of conditional expectation maximization (ECM cycles) to 40. On one of these 100 simulated data sets, their method appeared to have a runaway behavior and produced unreasonable estimates after 40 ECM cycles; this data set was excluded from these plots.
Figure 4.
Figure 4.
Mutation rates inferred by our method. (A) Inferred mutation rates for simulated data sets with 100 loci from 10,000 diploids under Scenario 1 with t1 = 100 and r1 = 6.4%. The mutation rates at the 100 loci were drawn randomly from the range [1.1 × 10−8, 3.8 × 10−8]. The loci are sorted in ascending order of the simulated mutation rates. The increasing solid line indicates the mutation rates used in the simulation, while the circle and the vertical bars, respectively, denote the median and one standard deviation of the inferred mutation rate over 100 simulated data sets. (B) Inferred mutation rates for each of the 185 genes in the exome-sequencing data set of Nelson et al. (2012). The solid line connects our point estimates for the mutation rate, while the light vertical bars denote 95% confidence intervals that were constructed by a resampling block bootstrap procedure with 1000 bootstrap samples. The dashed line connects the point estimates of the mutation rate inferred by Nelson et al. (2012). While the mutation rates estimated by our method and that of Nelson and coworkers are very close to each other, the mutation rates estimated by our method are systematically higher at each locus owing to the lower population expansion rate inferred by our method.
Figure 5.
Figure 5.
Calibration plots for asymptotic (A,B), and bootstrap (C,D) confidence intervals of the duration and rate of exponential growth for Scenario 1 with t1 = 100 gens and r1 = 6.4% per gen for 200 simulated data sets of 10,000 diploids, each with 100 unlinked loci of length m. For each confidence level α on the x-axis, the y-axis counts the fraction of data sets where the true parameter estimates lie outside the 100(1 − α)% predicted confidence interval. The straight black lines denote the plot that would be obtained from an idealized confidence interval estimation procedure. (A,B) Asymptotic confidence interval calibration plots for the inferred (A) duration and (B) rate of exponential growth. As the locus length m increases, linkage disequilibrium causes the composite log-likelihood approximation in Equation 9 to become increasingly inaccurate, thus leading to poorly calibrated asymptotic confidence intervals for m = 10 kb. (C,D) Bootstrap confidence interval calibration plots using 200 bootstrap replicates per simulated data set for the inferred (C) duration and (D) rate of exponential growth. The bootstrap confidence intervals are much better calibrated than those produced by the asymptotic confidence interval estimation procedure.

References

    1. Balding DJ, Nichols RA. 1997. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity 78: 583–589. - PubMed
    1. Bhaskar A, Song YS. 2014. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat 42: 2469–2493. - PMC - PubMed
    1. Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al. 2008. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4: e1000083. - PMC - PubMed
    1. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN. 2005. Demonstrating stratification in a European American population. Nat Genet 37: 868–872. - PubMed
    1. Chen H. 2012. The joint allele frequency spectrum of multiple populations: a coalescent theory approach. Theor Popul Biol 81: 179–195. - PubMed

Publication types