Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun;19(6):650-61.
doi: 10.1089/cmb.2012.0033.

Efficient simulation and likelihood methods for non-neutral multi-allele models

Affiliations

Efficient simulation and likelihood methods for non-neutral multi-allele models

Paul Joyce et al. J Comput Biol. 2012 Jun.

Abstract

Throughout the 1980s, Simon Tavaré made numerous significant contributions to population genetics theory. As genetic data, in particular DNA sequence, became more readily available, a need to connect population-genetic models to data became the central issue. The seminal work of Griffiths and Tavaré (1994a , 1994b , 1994c) was among the first to develop a likelihood method to estimate the population-genetic parameters using full DNA sequences. Now, we are in the genomics era where methods need to scale-up to handle massive data sets, and Tavaré has led the way to new approaches. However, performing statistical inference under non-neutral models has proved elusive. In tribute to Simon Tavaré, we present an article in spirit of his work that provides a computationally tractable method for simulating and analyzing data under a class of non-neutral population-genetic models. Computational methods for approximating likelihood functions and generating samples under a class of allele-frequency based non-neutral parent-independent mutation models were proposed by Donnelly, Nordborg, and Joyce (DNJ) (Donnelly et al., 2001). DNJ (2001) simulated samples of allele frequencies from non-neutral models using neutral models as auxiliary distribution in a rejection algorithm. However, patterns of allele frequencies produced by neutral models are dissimilar to patterns of allele frequencies produced by non-neutral models, making the rejection method inefficient. For example, in some cases the methods in DNJ (2001) require 10(9) rejections before a sample from the non-neutral model is accepted. Our method simulates samples directly from the distribution of non-neutral models, making simulation methods a practical tool to study the behavior of the likelihood and to perform inference on the strength of selection.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
Percent error in normalizing constant c(Σ, θν) obtained by Monte Carlo integration using samples simulated under the neutral model when the selection matrix in the non-neutral model is diagonal and all values of selection parameters are equal. Monte Carlo averages in calculating the normalizing constants are based on 103,104,105, and 106 samples drawn from the neutral model, illustrated from dark to light color respectively. The true value of c(Σ, θν) is assumed to be the value computed by numerical analysis methods. As the strength of selection (given by parameter σ = (20, 40, 60, 80, 100)) increases, the relative error in c(Σ, θν) obtained by Monte Carlo integration increases on average because of substantial difference between allele frequency patterns in the non-neutral model and allele frequency patterns in the neutral model. Percent error values larger than 10% are given at the top of the corresponding bar numerically.
FIG. 2.
FIG. 2.
Posterior distributions (top) and boxplots (bottom) of selection and mutation parameters. The “true” population frequencies are simulated under a K = 10 allele model with 10 replicates (e.g., 10 loci) and a diagonal selection matrix where all elements are equal (i.e., symmetric balancing selection model). The “true” selection parameter is σ = 10 and the mutation parameter is θ = 2. Top left and right histograms are posterior samples of selection and mutation parameters respectively, both obtained by approximate Bayesian computation as explained in Algorithm-ABC in the text. Bottom left and right series of box plots are from posterior samples for 30 independent runs of “true” population frequencies, obtained by the same procedure as above.
FIG. 3.
FIG. 3.
Kernel density estimates of posterior distributions of selection (top) and mutation (bottom) parameters obtained analyzing allele-frequencies under three non-neutral models. The models differ in their assumptions on the number of allelic types in the population, K. The “true” population has K = 10 alleles but only 7 largest-frequency alleles are observed in the sample. Green plots are obtained by analyzing these 7 frequencies under the assumption that r = 7 largest-frequency alleles out of K = 10 total alleles are observed in the data. Red plots are obtained by analyzing the 7 frequencies under the assumption that these are all the existing allelic types in the population. Hence, the second model corresponds to using a K = 7 allele model, whereas the frequencies are actually generated under a model with K = 10. Using the second model, the mutation parameter is overestimated (bottom, red plot) because of missing three small-frequency alleles. Consequently, estimates for the selection parameter are not accurate. Blue plots are obtained by assuming ideal conditions where all 10 allelic types are observed and the sample frequencies are proxy for the population frequencies. Hence, the blue plots are obtained under the correct model when there is no missing data and inference in this case is the gold standard for the other two models.

References

    1. Beaumont M.A. Zhang W. Balding D.J. Approximate Bayesian computation in population genetics. Genetics. 2002;162:2025–2035. - PMC - PubMed
    1. Buzbas E.O. Joyce P. Maximum likelihood estimates under k-allele models with selection can be numerically unstable. Ann. Appl. Stat. 2009;3:1147–1162.
    1. Buzbas E.O. Joyce P. Abdo Z. Estimation of selection intensity under overdominance by Bayesian methods. Stat. Appl. Genet. Mol. Biol. 2009;8 article32. - PMC - PubMed
    1. Buzbas E.O. Joyce P. Rosenberg N.A. Inference on balancing selection for epistatically interacting loci. Theor. Popul. Biol. 2011;79:102–113. - PMC - PubMed
    1. Davis P.J. Rabinowitz P. Methods of Numerical Integration. Academic Press; New York: 1984.

Publication types