. 2016 Mar 4;12(3):e1005877.

doi: 10.1371/journal.pgen.1005877. eCollection 2016 Mar.

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

Simon Boitard^{1

2}, Willy Rodríguez³, Flora Jay^{4

5}, Stefano Mona¹, Frédéric Austerlitz⁴

Affiliations

¹ Institut de Systématique, Évolution, Biodiversité ISYEB - UMR 7205 - CNRS & MNHN & UPMC & EPHE, Ecole Pratique des Hautes Etudes, Sorbonne Universités, Paris, France.
² GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France.
³ UMR CNRS 5219, Institut de Mathématiques de Toulouse, Université de Toulouse, Toulouse, France.
⁴ UMR 7206 Eco-anthropologie et Ethnobiologie, Muséum National d'Histoire Naturelle, CNRS, Université Paris Diderot, Paris, France.
⁵ LRI, Paris-Sud University, CNRS UMR 8623, Orsay, France.

PMID: 26943927
PMCID: PMC4778914
DOI: 10.1371/journal.pgen.1005877

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

Simon Boitard et al. PLoS Genet. 2016.

. 2016 Mar 4;12(3):e1005877.

doi: 10.1371/journal.pgen.1005877. eCollection 2016 Mar.

Authors

Simon Boitard^{1

2}, Willy Rodríguez³, Flora Jay^{4

5}, Stefano Mona¹, Frédéric Austerlitz⁴

Affiliations

¹ Institut de Systématique, Évolution, Biodiversité ISYEB - UMR 7205 - CNRS & MNHN & UPMC & EPHE, Ecole Pratique des Hautes Etudes, Sorbonne Universités, Paris, France.
² GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France.
³ UMR CNRS 5219, Institut de Mathématiques de Toulouse, Université de Toulouse, Toulouse, France.
⁴ UMR 7206 Eco-anthropologie et Ethnobiologie, Muséum National d'Histoire Naturelle, CNRS, Université Paris Diderot, Paris, France.
⁵ LRI, Paris-Sud University, CNRS UMR 8623, Orsay, France.

PMID: 26943927
PMCID: PMC4778914
DOI: 10.1371/journal.pgen.1005877

Abstract

Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Optimization of ABC procedure.**
Prediction error (left panel) and bias (right panel) for the estimated population size in each time window, evaluated from 2,000 random population size histories (see Methods). Summary statistics considered in the ABC analysis were (i) the AFS and (ii) the average zygotic LD for several distance bins. These statistics were computed from n = 25 diploid individuals, using all SNPs for AFS statistics and SNPs with a MAF above 20% for LD statistics. The posterior distribution of each parameter was obtained by rejection, ridge regression [33] or neural network regression [32]. The tolerance rate used for each of these approaches was the one providing the lowest prediction errors, for different values from 0.001 to 0.05. Population size point estimates were obtained from the median or the mode of the posterior distribution. The prediction errors were scaled in order that point estimates obtained from the prior distribution would result in a prediction error of 1.

**Fig 2. Accuracy of ABC estimation and relative importance of the summary statistics.**
Prediction error for the estimated population size in each time window (left) and standard deviation of this error (right), evaluated from 2,000 random population size histories. Summary statistics considered in the ABC analysis included different combinations of (i) the AFS (possibly without the overall proportion of SNPs) and (ii) the average zygotic LD for several distance bins. These statistics were computed from n = 25 diploid individuals, using all SNPs for AFS statistics and only those with a MAF above 20% for LD statistics. The posterior distribution of each parameter was obtained by neural network regression [32], with a tolerance rate of 0.005. Population size point estimates correspond to the median of the posterior distribution. The prediction errors were scaled in order that point estimates obtained from the prior distribution would result in a prediction error of 1.

**Fig 3. Estimation of population size history using ABC in six different simulated scenarios.**
a small constant population size (N = 500, top left), a large constant population size (N = 50,000, top right), a decline scenario mimicking the population size history in Holstein cattle (middle left), an expansion scenario mimicking the population size history in CEU human (middle right), a scenario with one expansion followed by one bottleneck (bottom left) and a zigzag scenario similar to that used in [10] (bottom right), with one expansion followed by two bottlenecks. For each scenario, the true population size history is shown by the dotted black line, the average estimated history over 20 PODs is shown by the solid black line, the estimated histories for five random PODs are shown by solid colored lines, and the 90% credible interval for one of these PODs is shown by the dotted red lines. The expected time to the most recent common ancestor (TMRCA) of the sample, E[*TMRCA*], is indicated by the vertical dotted black line. Summary statistics considered in the ABC analysis were (i) the AFS and (ii) the average zygotic LD for several distance bins. These statistics were computed from n = 25 diploid individuals, using all SNPs for AFS statistics and SNPs with a MAF above 20% for LD statistics. The posterior distribution of each parameter was obtained by neural network regression [32], with a tolerance rate of 0.005. Population size point estimates were obtained from the median of the posterior distribution.

**Fig 4. Estimation of population size history using MSMC with two haplotypes in five different simulated scenarios.**
For each scenario, the five PODs considered for MSMC estimation were the same as in Fig 3. The expected TMRCA shown here is also the same as in Fig 3, it corresponds to samples of 50 haploid sequences.

**Fig 5. Influence of phasing and sequencing errors on ABC estimation.**
Estimation of population size history in the Holstein cattle breed using ABC, based on whole genome NGS data from n = 25 animals. Summary statistics considered in the ABC analysis were (i) the AFS and (ii) the average LD for several distance bins. LD statistics were computed either from haplotypes or from genotypes, using SNPs with a MAF above 20%. AFS statistics were computed using either all SNPs or SNPs with a MAF above 10 or 20%. The posterior distribution of each parameter was obtained by neural network regression [32], with a tolerance rate of 0.005. Population size point estimates were obtained from the median of the posterior distribution. Generation time was assumed to be five years.

**Fig 6. Estimation of population size history in four cattle breeds using ABC.**
Angus (n = 25 animals), Fleckvieh (n = 25), Holstein (n = 25) and Jersey (n = 15). Estimations were obtained independently in each breed, based on whole genome NGS data from sampled animals. Summary statistics considered in the ABC analysis were (i) the AFS and (ii) the average zygotic LD for several distance bins. These statistics were computed using SNPs with a MAF above 20%. Other parameter settings are the same as in Fig 5.

**Fig 7. Comparison of summary statistics for the estimation of population size history in three scenarios.**
“bottleneck1 recent small” (top), “bottleneck cattle middle age” (middle) and “zigzag small” (bottom). Summary statistics considered in the ABC analysis were either the AFS statistics alone (left column), the LD statistics alone (middle column), or the AFS and LD statistics together (right column). All other settings are similar to Fig 3, as well as the legend.

See this image and copyright information in PMC

Cited by

Conservation Genomic Analysis of the Asian Honeybee in China Reveals Climate Factors Underlying Its Population Decline.
Sang H, Li Y, Sun C. Sang H, et al. Insects. 2022 Oct 19;13(10):953. doi: 10.3390/insects13100953. Insects. 2022. PMID: 36292899 Free PMC article.
Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle.
Pausch H, MacLeod IM, Fries R, Emmerling R, Bowman PJ, Daetwyler HD, Goddard ME. Pausch H, et al. Genet Sel Evol. 2017 Feb 21;49(1):24. doi: 10.1186/s12711-017-0301-x. Genet Sel Evol. 2017. PMID: 28222685 Free PMC article.
Construction of PRDM9 allele-specific recombination maps in cattle using large-scale pedigree analysis and genome-wide single sperm genomics.
Zhou Y, Shen B, Jiang J, Padhi A, Park KE, Oswalt A, Sattler CG, Telugu BP, Chen H, Cole JB, Liu GE, Ma L. Zhou Y, et al. DNA Res. 2018 Apr 1;25(2):183-194. doi: 10.1093/dnares/dsx048. DNA Res. 2018. PMID: 29186399 Free PMC article.
Population structure, genetic diversity, and selective signature of Chaka sheep revealed by whole genome sequencing.
Cheng J, Zhao H, Chen N, Cao X, Hanif Q, Pi L, Hu L, Chaogetu B, Huang Y, Lan X, Lei C, Chen H. Cheng J, et al. BMC Genomics. 2020 Jul 29;21(1):520. doi: 10.1186/s12864-020-06925-z. BMC Genomics. 2020. PMID: 32727368 Free PMC article.
The Genomic Footprints of the Fall and Recovery of the Crested Ibis.
Feng S, Fang Q, Barnett R, Li C, Han S, Kuhlwilm M, Zhou L, Pan H, Deng Y, Chen G, Gamauf A, Woog F, Prys-Jones R, Marques-Bonet T, Gilbert MTP, Zhang G. Feng S, et al. Curr Biol. 2019 Jan 21;29(2):340-349.e7. doi: 10.1016/j.cub.2018.12.008. Epub 2019 Jan 10. Curr Biol. 2019. PMID: 30639104 Free PMC article.

See all "Cited by" articles

References

1. Lorenzen E, Nogues-Bravo D, Orlando L, Weinstock J, Binladen J, Marske K, et al. Species-specific responses of Late Quaternary megafauna to climate and humans. Nature. 2011;479(7373):359–364. 10.1038/nature10574 - DOI - PMC - PubMed
1. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a High-Density SNP Map for Signatures of Natural Selection. Genome Research. 2002;12(12):1805–1814. Available from: http://genome.cshlp.org/content/12/12/1805.abstract. 10.1101/gr.631202 - DOI - PMC - PubMed
1. Goldstein DB, Chikhi L. HUMAN MIGRATIONS AND POPULATION STRUCTURE: What We Know and Why it Matters. Annual Review of Genomics and Human Genetics. 2002;3(1):129–152. Available from: 10.1146/annurev.genom.3.022502.103200. 10.1146/annurev.genom.3.022502.103200 - DOI - DOI - PubMed
1. Quéméré E, Amelot X, Pierson J, Crouau-Roy B, Chikhi L. Genetic data suggest a natural prehuman origin of open habitats in northern Madagascar and question the deforestation narrative in this region. Proceedings of the National Academy of Sciences. 2012;109(32):13028–13033. Available from: http://www.pnas.org/content/109/32/13028.abstract. 10.1073/pnas.1200153109 - DOI - PMC - PubMed
1. Pybus OG, Rambaut A, Harvey PH. An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies. Genetics. 2000;155(3):1429–1437. Available from: http://www.genetics.org/content/155/3/1429.abstract. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

Affiliations

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous