. 2010 Jul;27(7):1673-85.

doi: 10.1093/molbev/msq053. Epub 2010 Feb 25.

A population genetic hidden Markov model for detecting genomic regions under selection

Andrew D Kern¹, David Haussler

Affiliations

PMID: 20185453
PMCID: PMC2912474
DOI: 10.1093/molbev/msq053

A population genetic hidden Markov model for detecting genomic regions under selection

Andrew D Kern et al. Mol Biol Evol. 2010 Jul.

. 2010 Jul;27(7):1673-85.

doi: 10.1093/molbev/msq053. Epub 2010 Feb 25.

Authors

Andrew D Kern¹, David Haussler

Affiliation

¹ Department of Biological Sciences, Dartmouth College, Hanover, NH, USA. andrew.d.kern@dartmouth.edu

PMID: 20185453
PMCID: PMC2912474
DOI: 10.1093/molbev/msq053

Abstract

Recently, hidden Markov models have been applied to numerous problems in genomics. Here, we introduce an explicit population genetics hidden Markov model (popGenHMM) that uses single nucleotide polymorphism (SNP) frequency data to identify genomic regions that have experienced recent selection. Our popGenHMM assumes that SNP frequencies are emitted independently following diffusion approximation expectations but that neighboring SNP frequencies are partially correlated by selective state. We give results from the training and application of our popGenHMM to a set of early release data from the Drosophila Population Genomics Project (dpgp.org) that consists of approximately 7.8 Mb of resequencing from 32 North American Drosophila melanogaster lines. These results demonstrate the potential utility of our model, making predictions based on the site frequency spectrum (SFS) for regions of the genome that represent selected elements.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1. — **FIG. 1.**
A two-state popGenHMM. To the left, a graphical representation of the popGenHMM is shown with states depicted by nodes and transitions among states shown with the unlabeled edges. As the model is Markovian, the sum of all transition probabilities exiting a node sum to 1. To the right, a histogram representing the expected SFS from each of the states of the model is given, assuming a sample size of n = 20. Two states are shown, a neutral state (blue) that emits allele frequencies based on the neutral SFS shown to the right and a state labeled negative (red) that represents a selected state. In this case, selection is negative (α = – 10) and the corresponding SFS for emissions is shown to the right. Note that for a two-state popGenHMM, the selected state need not be negative.

F<sc>IG</sc>. 2. — **FIG. 2.**
A three-state popGenHMM. To the left, a graphical representation of the popGenHMM is shown with states depicted by nodes and transitions among states shown with the unlabeled edges. See caption of figure 1 for details. To the right, a histogram representing the expected SFS from each of the states of the model is given, assuming a sample size of n = 20. Three states are shown, a neutral state (blue), a selected state labeled negative (red) (α = – 10), and a second selected state labeled positive (yellow) (α = 10).

F<sc>IG</sc>. 3. — **FIG. 3.**
Proportion of bottleneck simulations rejecting neutrality. Shown is a comparison of the false-positive rate of the popGenHMM and the CLRT of Kim and Stephan (2002). Two severities of the strength of the bottleneck are shown f = 0.05 and f = 0.1 for each model.

F<sc>IG</sc>. 4. — **FIG. 4.**
Power to detect a single selective sweep. Shown is a comparison of the power of our popGenHMM as a function of the strength of selection, α = 2Ns, in comparison with CLSW (Kim and Stephan 2002), SweepFinder (Nielsen et al. 2005), and a sliding window implementation of Tajima's D (Tajima 1989). Each point consists of 1,000 coalescent simulations with a single selective sweep (stochastic trajectory) that has finished its sojourn through the population just before the current generation (τ = 0). We simulated 20 kb from samples of size n = 50 with θ/bp = 0.01 and ρ/bp = 0.025.

F<sc>IG</sc>. 5. — **FIG. 5.**
Power to detect a locus undergoing recurrent directional selection. Shown is a comparison of the power of our popGenHMM as a function of the strength of selection, α = 2Nσ, in comparison with CLSW (Kim and Stephan 2002), SweepFinder (Nielsen et al. 2005), and a sliding window implementation of Tajima's D (Tajima 1989) to detect selection on a locus evolving according to the normal shift model. Each point consists of 1,000 samples drawn from forward population genetic simulations in which we simulated 20 kb from samples of size n = 50 with θ/bp = 0.01 and ρ/bp = 0.025. Selected sites occur in the middle fifth of the simulated locus (bases 8,000–12,000) and θ is constant across the entire region.

F<sc>IG</sc>. 6. — **FIG. 6.**
Power to detect a locus undergoing recurrent negative selection. Shown is a comparison of the power of our popGenHMM as a function of the strength of selection, α = 2Nσ, SweepFinder (Nielsen et al. 2005) and a sliding window implementation of Tajima's D (Tajima 1989) to detect selection on a locus evolving according to the exponential shift model. Each point consists of 1,000 samples drawn from forward population genetic simulations in which we simulated 20 kb from samples of size n = 50 with θ/bp = 0.01 and ρ/bp = 0.025. Selected sites occur in the middle fifth of the simulated locus (bases 8,000–12,000) and θ is constant across the entire region.

F<sc>IG</sc>. 7. — **FIG. 7.**
Distribution of length normalized scores for elements. Shown are histograms of LOD scores/length (s) for each of the elements predicted. Negatively selected elements are shown in the top panel and positively selected elements in the bottom panel. The very different distributions is a function of the short lengths of positive elements predicted.

F<sc>IG</sc>. 8. — **FIG. 8.**
Browser shot of a negatively selected element prediction. This negative element prediction is shown as the top browser track for this region of the genome. This element has a strong prediction corresponding to a LOD score of 33.1. Shown below the prediction are two tracks corresponding to divergence between *Drosophila simulans* and *D. melanogaster* (labeled Div) and nucleotide diversity (π; Tajima 1983) within *D. melanogaster* (labeled Pi). For both these tracks, darker colors represent greater relative levels of divergence and polymorphism. See text for details.

F<sc>IG</sc>. 9. — **FIG. 9.**
Browser shot of a positively selected element prediction. This positive element prediction is shown as the fourth browser track from the top for this region of the genome. This element has a prediction score of LOD = 29.6. See caption of figure 5 for details.

See this image and copyright information in PMC

Cited by

Evolutionary forces shaping genomic islands of population differentiation in humans.
Hofer T, Foll M, Excoffier L. Hofer T, et al. BMC Genomics. 2012 Mar 22;13:107. doi: 10.1186/1471-2164-13-107. BMC Genomics. 2012. PMID: 22439654 Free PMC article.
Supervised Machine Learning for Population Genetics: A New Paradigm.
Schrider DR, Kern AD. Schrider DR, et al. Trends Genet. 2018 Apr;34(4):301-312. doi: 10.1016/j.tig.2017.12.005. Epub 2018 Jan 10. Trends Genet. 2018. PMID: 29331490 Free PMC article. Review.
Detecting Selection from Linked Sites Using an F-Model.
Galimberti M, Leuenberger C, Wolf B, Szilágyi SM, Foll M, Wegmann D. Galimberti M, et al. Genetics. 2020 Dec;216(4):1205-1215. doi: 10.1534/genetics.120.303780. Epub 2020 Oct 16. Genetics. 2020. PMID: 33067324 Free PMC article.
Genomics of isolation in hybrids.
Gompert Z, Parchman TL, Buerkle CA. Gompert Z, et al. Philos Trans R Soc Lond B Biol Sci. 2012 Feb 5;367(1587):439-50. doi: 10.1098/rstb.2011.0196. Philos Trans R Soc Lond B Biol Sci. 2012. PMID: 22201173 Free PMC article.
A population genetics-phylogenetics approach to inferring natural selection in coding sequences.
Wilson DJ, Hernandez RD, Andolfatto P, Przeworski M. Wilson DJ, et al. PLoS Genet. 2011 Dec;7(12):e1002395. doi: 10.1371/journal.pgen.1002395. Epub 2011 Dec 1. PLoS Genet. 2011. PMID: 22144911 Free PMC article.

See all "Cited by" articles

References

1. Adams MD, Celniker SE, Holt RA, et al. (193 co-authors) The genome sequence of Drosophila melanogaster. Science. 2000;287(5461):2185–2195. - PubMed
1. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12(12):1805–1814. - PMC - PubMed
1. Andolfatto P. Adaptive evolution of non-coding DNA in Drosophila. Nature. 2005;437(7062):1149–1152. - PubMed
1. Ashburner M, Ball CA, Blake JA, et al. (20 co-authors) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–29. - PMC - PubMed
1. Baldi P, Chauvin Y, Hunkapiller T, McClure MA. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci U S A. 1994;91(3):1059–1063. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A population genetic hidden Markov model for detecting genomic regions under selection

Affiliation

A population genetic hidden Markov model for detecting genomic regions under selection

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous