Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr;9(4):1006-1016.
doi: 10.1111/2041-210X.12968. Epub 2018 Jan 30.

A fast likelihood solution to the genetic clustering problem

Affiliations

A fast likelihood solution to the genetic clustering problem

Marie-Pauline Beugin et al. Methods Ecol Evol. 2018 Apr.

Abstract

The investigation of genetic clusters in natural populations is an ubiquitous problem in a range of fields relying on the analysis of genetic data, such as molecular ecology, conservation biology and microbiology. Typically, genetic clusters are defined as distinct panmictic populations, or parental groups in the context of hybridisation. Two types of methods have been developed for identifying such clusters: model-based methods, which are usually computer-intensive but yield results which can be interpreted in the light of an explicit population genetic model, and geometric approaches, which are less interpretable but remarkably faster.Here, we introduce snapclust, a fast maximum-likelihood solution to the genetic clustering problem, which allies the advantages of both model-based and geometric approaches. Our method relies on maximising the likelihood of a fixed number of panmictic populations, using a combination of geometric approach and fast likelihood optimisation, using the Expectation-Maximisation (EM) algorithm. It can be used for assigning genotypes to populations and optionally identify various types of hybrids between two parental populations. Several goodness-of-fit statistics can also be used to guide the choice of the retained number of clusters.Using extensive simulations, we show that snapclust performs comparably to current gold standards for genetic clustering as well as hybrid detection, with some advantages for identifying hybrids after several backcrosses, while being orders of magnitude faster than other model-based methods. We also illustrate how snapclust can be used for identifying the optimal number of clusters, and subsequently assign individuals to various hybrid classes simulated from an empirical microsatellite dataset. snapclust is implemented in the package adegenet for the free software R, and is therefore easily integrated into existing pipelines for genetic data analysis. It can be applied to any kind of co-dominant markers, and can easily be extended to more complex models including, for instance, varying ploidy levels. Given its flexibility and computer-efficiency, it provides a useful complement to the existing toolbox for the study of genetic diversity in natural populations.

Keywords: EM algorithm; SNP; genetic assignment; genetic clustering; hybridisation; microsatellites; population membership; relative performances.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of the various methods on simulated genetic clusters. Notes: This figure shows the distribution of (a) the true positive rates (TPR) and (b) true negative rates (TNR) obtained over all the 360 simulations for the four different methods: snapclust (SC), BAPS (B), STRUCTURE (S) and find.clusters (FC) for the clustering of individuals in absence of hybrids. This width of the enveloppes reflects the density of points
Figure 2
Figure 2
Comparison of snapclust (red) and NEWHYBRIDS (blue) for the identification of hybrids using simulated data(. Notes: This figure shows the distributions of (a) the mean proportion of correct group assignment and (b) the support (i.e. group membership probability) for the true class across all simulated datasets. Three hybrid classes are considered in the simulations in addition to the parental class: first‐generation hybrids (F1), first‐generation backcrosses (BC1) and second‐generation backcrosses (BC2). This width of the enveloppes reflects the density of points
Figure 3
Figure 3
Illustration of snapclust using simulated hybrids from cattle breed microsatellite data. Notes: (a) Representation of the Akaike Criterion value according to the number of populations (K) considered. (b) Representation of the individual probability of assignment obtained with the function snapclust.em for the different types of individuals present in the dataset. (c) Representation of the first axis of the discriminant analysis of principal components carried out on the hybrid groups found using the snapclust analysis

References

    1. Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle In Parzen E., Tanabe K., & Kitagawa G. (Eds.), Selected papers of Hirotugu Akaike (pp. 199–213). Springer Series in Statistics. New York, NY: Springer; https://doi.org/10.1007/978-1-4612-1694-0 - DOI
    1. Akogul, S. , & Erisoglu, M. (2016). A comparison of information criteria in clustering based on mixture of multivariate normal distributions. Mathematical & Computational Applications, 21, 34 https://doi.org/10.3390/mca21030034 - DOI
    1. Alexander, D. H. , Novembre, J. , & Lange, K. (2009). Fast model‐based estimation of ancestry in unrelated individuals. Genome Research, 19, 1655–1664. https://doi.org/10.1101/gr.094052.109 - DOI - PMC - PubMed
    1. Allendorf, F. W. , Leary, R. F. , Spruell, P. , & Wenburg, J. K. (2001). The problems with hybrids: Setting conservation guidelines. Trends in Ecology & Evolution, 16, 613–622. https://doi.org/10.1016/S0169-5347(01)02290-X - DOI
    1. Anderson, E. C. , & Thompson, E. A. (2002). A model‐based method for identifying species hybrids using multilocus genetic data. Genetics, 160, 1217–1229. - PMC - PubMed