Review

. 2017 Jan 1;66(1):e30-e46.

doi: 10.1093/sysbio/syw056.

Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data

Paula Tataru¹, Maria Simonsen¹, Thomas Bataillon¹, Asger Hobolth¹

Affiliations

PMID: 28173553
PMCID: PMC5837693
DOI: 10.1093/sysbio/syw056

Review

Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data

Paula Tataru et al. Syst Biol. 2017.

. 2017 Jan 1;66(1):e30-e46.

doi: 10.1093/sysbio/syw056.

Authors

Paula Tataru¹, Maria Simonsen¹, Thomas Bataillon¹, Asger Hobolth¹

Affiliation

¹ Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark.

PMID: 28173553
PMCID: PMC5837693
DOI: 10.1093/sysbio/syw056

Abstract

The Wright–Fisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the Wright–Fisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the Wright–Fisher model, and we consider these in turn. We begin our review with the basic bi-allelic Wright–Fisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion-based and moment-based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi-allelic process with a general mutation model.

Keywords: Allele frequency; diffusion; inference; moments; selection; Wright–Fisher.

PubMed Disclaimer

Figures

**Figure 1.**
Data types. The gray boxes represent the unobserved history of the populations, together with the corresponding population allele frequency , whereas the white boxes indicate the observed data: the generation when the data are sampled, the size of the sample, and the allele count , that is, how many alleles of a given type have been observed among the genotyped individuals. Given the population frequency , follows a binomial distribution with size and probability . In order to calculate the likelihood of the data, the DAF of is needed. a) Time series data where, typically, one population is sampled at different (known) generations. b) Single time-point data, where multiple populations are sampled just once, typically in the present. The history of the populations is given as a tree. The leaves and internal nodes represent the sampled and ancestral populations, respectively. The branch lengths reflect the amount of time populations have diverged since the split from the ancestral population.

formula image — **Figure 1.**
Data types. The gray boxes represent the unobserved history of the populations, together with the corresponding population allele frequency , whereas the white boxes indicate the observed data: the generation when the data are sampled, the size of the sample, and the allele count , that is, how many alleles of a given type have been observed among the genotyped individuals. Given the population frequency , follows a binomial distribution with size and probability . In order to calculate the likelihood of the data, the DAF of is needed. a) Time series data where, typically, one population is sampled at different (known) generations. b) Single time-point data, where multiple populations are sampled just once, typically in the present. The history of the populations is given as a tree. The leaves and internal nodes represent the sampled and ancestral populations, respectively. The branch lengths reflect the amount of time populations have diverged since the split from the ancestral population.

**Figure 2.**
Dynamics in the pure drift bi-allelic Wright–Fisher model. The child inherits the parental allele.

**Figure 3.**
a) Simulation under the pure drift model (equation (1)) with and . The vertical bars indicate three sampled time-points. The -axis denotes the time measured in scaled number of generations. b) DAF at the three sampled time-points. The vertical bars indicate the simulated allele frequencies.

**Figure 4.**
Fit of various approximations to the pure drift true DAF, calculated using the Markov chain property for and a range of and . Each column shows a different type of approximation, indicated at the top of the figure. a) Hellinger distance on log scale between the approximated and true DAF. The three “”s in each of the heatmaps indicate the combinations of and used in b). b) True (dashed lines) and approximated (solid lines) DAF for and different values of . The truncated normal, beta and beta with spikes are discretized as in Tataru et al. (2015). The diffusion DAF is calculated as in Zhao et al. (2013), with and . We used for computational reasons, but we see similar patterns for larger .

**Figure 5.**
Dynamics in the bi-allelic Wright–Fisher model with mutations. If the parental allele is , the child has the same allele with probability , and a mutation occurs with probability . If the parental allele is , the child allele is with probability , and becomes with probability .

**Figure 6.**
Fit of various approximations to the true DAF with neutral mutations, calculated using the Markov chain property for , and a range of and . Each column shows a different type of approximation, indicated at the top of the figure. a) Hellinger distance on log scale between the approximated and true DAF. The three ""s in each of the heatmaps indicate the combinations of and used in b). b) True (dashed lines) and approximated (solid lines) DAF for and different values of . Calculations are performed as for Figure 4. For comparison purposes, the a) heatmap and b) -axis scales are the same as in Figure 4.

**Figure 7.**
Fit of various approximations to the true DAF with selection, calculated using the Markov chain property for , , and a range of and . Each column shows a different type of approximation, indicated at the top of the figure. a) Hellinger distance on log scale between the approximated and true DAF. The three “”s in each of the heatmaps indicate the combinations of and used in b). b) True (dashed lines) and approximated (solid lines) DAF for and different values of . Calculations are performed as for Figure 4. For comparison purposes, the a) heatmap and b) -axis scales are the same as in Figure 4.

**Figure 8.**
Dynamics in the pure drift multi-allelic Wright–Fisher model for . The child inherits the parental allele.

**Figure 9.**
Dynamics in the multi-allelic Wright–Fisher model with mutations for . If the parental allele is , the child receives the same allele with probability and another allele with probability , for .

**Figure 10.**
Fit of the Dirichlet distribution (dotted lines) to the true mean and covariance of the multi-allelic JC Wright–Fisher model (solid lines) with a) ( small), and b) ( large). All six plots are calculated for , , and different values of .

See this image and copyright information in PMC

References

1. Balding D.J, Nichols R.A. 1995.. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96:3–12. - PubMed
1. Balding D.J, Nichols R.A. 1997.. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity 78(6):583–589. - PubMed
1. Balding D.J, Steele C.D. 2015.. Weight-of-evidence for forensic DNA profiles. Chichester: John Wiley and Sons.
1. Barton N.H, Otto S.P. 2005.. Evolution of recombination due to random drift. Genetics 169(4):2353–2370. - PMC - PubMed
1. Beaumont M.A, Zhang W., Balding D.J. 2002.. Approximate Bayesian computation in population genetics. Genetics 162(4):2025–2035. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data

Affiliation

Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous