Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Apr;175(4):1787-802.
doi: 10.1534/genetics.106.061317. Epub 2007 Jan 21.

Inference of population structure under a Dirichlet process model

Affiliations

Inference of population structure under a Dirichlet process model

John P Huelsenbeck et al. Genetics. 2007 Apr.

Abstract

Inferring population structure from genetic data sampled from some number of individuals is a formidable statistical problem. One widely used approach considers the number of populations to be fixed and calculates the posterior probability of assigning individuals to each population. More recently, the assignment of individuals to populations and the number of populations have both been considered random variables that follow a Dirichlet process prior. We examined the statistical behavior of assignment of individuals to populations under a Dirichlet process prior. First, we examined a best-case scenario, in which all of the assumptions of the Dirichlet process prior were satisfied, by generating data under a Dirichlet process prior. Second, we examined the performance of the method when the genetic data were generated under a population genetics model with symmetric migration between populations. We examined the accuracy of population assignment using a distance on partitions. The method can be quite accurate with a moderate number of loci. As expected, inferences on the number of populations are more accurate when theta = 4N(e)u is large and when the migration rate (4N(e)m) is low. We also examined the sensitivity of inferences of population structure to choice of the parameter of the Dirichlet process model. Although inferences could be sensitive to the choice of the prior on the number of populations, this sensitivity occurred when the number of loci sampled was small; inferences are more robust to the prior on the number of populations when the number of sampled loci is large. Finally, we discuss several methods for summarizing the results of a Bayesian Markov chain Monte Carlo (MCMC) analysis of population structure. We develop the notion of the mean population partition, which is the partition of individuals to populations that minimizes the squared partition distance to the partitions sampled by the MCMC algorithm.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.—
Figure 1.—
Example of the calculation of the mean partition. The mean partition, formula image, minimizes the sum of the squared distances to the partitions sampled by the MCMC algorithm.
F<sc>igure</sc> 2.—
Figure 2.—
The mean partitions for analyses of the Imapala Aegyceros melampus data (Lorenzen et al. 2004, 2006). Analyses were performed in which the number of populations was fixed (K = 1, K = 2, K = 3, K = 4, K = 5, K = 6, K = 7) or in which the number of populations was a random variable with a Dirichlet process prior [E(K) = 2, E(K) = 5, E(K) = 10, E(K) = 20]. The assignment of individuals (boxes) is indicated by color.
F<sc>igure</sc> 3.—
Figure 3.—
The mean partitions for analyses of the Taita thrush data (Galbusera et al. 2000). Analyses were performed in which the number of populations was fixed (K = 1, K = 2, K = 3, K = 4, K = 5, K = 6, K = 7) or in which the number of populations was a random variable with a Dirichlet process prior [E(K) = 2, E(K) = 5, E(K) = 10, E(K) = 20]. The assignment of individuals (boxes) is indicated by color.
F<sc>igure</sc> 4.—
Figure 4.—
The mean partitions for analyses of the Mus musculus data (Orth et al. 1998). Analyses were performed in which the number of populations was fixed (K = 1, K = 2, K = 3, K = 4, K = 5, K = 6, K = 7) or in which the number of populations was a random variable with a Dirichlet process prior [E(K) = 2, E(K) = 5, E(K) = 10, E(K) = 20]. The assignment of individuals (boxes) is indicated by color.
F<sc>igure</sc> 5.—
Figure 5.—
The probabilities and Bayes' factors of all pairs of individuals being grouped together into the same population. Each triangle shows all formula image pairs of individuals. The top left corner of a triangle shows the probability/Bayes' factor for individuals 1 and 2, the top right corner of the triangle shows the probability/Bayes' factor for individuals 1 and n, and the bottom corner of the triangle shows the probability/Bayes' factor for individuals n − 1 and n. Generally speaking, Bayes' factors >10 are strong evidence for two individuals being grouped together in the same population whereas Bayes' factors formula image are strong evidence against grouping two individuals into the same population. All analyses were run assuming a Dirichlet process prior on the number of populations and a prior mean for the number of populations of E(K) = 5.
F<sc>igure</sc> 6.—
Figure 6.—
The marginal likelihoods Pr(X | K) when the number of populations (K) is fixed to different values (K = 1, 2, …, 7). (a) Impala data (Lorenzen et al. 2004, 2006); (b) Taita thrush data (Galbusera et al. 2000); (c) Mus musculus data (Orth et al. 1998).
F<sc>igure</sc> 7.—
Figure 7.—
The prior, Pr(K), and posterior, Pr(K | X), probability distributions for the number of populations for the Impala data set when the prior mean of the number of populations varies.

Similar articles

Cited by

References

    1. Akaike, H., 1973. Information theory as an extension of the maximum likelihood principle, pp. 267–281 in Second International Symposium on Information Theory, edited by B. N. Petrov and F. Csaki. Akademiai Kiado, Budapest.
    1. Andolfatto, P., and M. Przeworski, 2000. A genome-wide departure from the standard neutral model in natural populations of Drosophila. Genetics 156: 257–268. - PMC - PubMed
    1. Antoniak, C. E., 1974. Mixtures of Dirichlet processes with applications to non-parametric problems. Ann. Stat. 2: 1152–1174.
    1. Balding, D. J., and R. A. Nichols, 1995. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12. - PubMed
    1. Bell, E. T., 1934. Exponential numbers. Am. Math. Mon. 41: 411–419.

Publication types