Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 15;12(3):e1005928.
doi: 10.1371/journal.pgen.1005928. eCollection 2016 Mar.

S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning

Affiliations

S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning

Daniel R Schrider et al. PLoS Genet. .

Abstract

Detecting the targets of adaptive natural selection from whole genome sequencing data is a central problem for population genetics. However, to date most methods have shown sub-optimal performance under realistic demographic scenarios. Moreover, over the past decade there has been a renewed interest in determining the importance of selection from standing variation in adaptation of natural populations, yet very few methods for inferring this model of adaptation at the genome scale have been introduced. Here we introduce a new method, S/HIC, which uses supervised machine learning to precisely infer the location of both hard and soft selective sweeps. We show that S/HIC has unrivaled accuracy for detecting sweeps under demographic histories that are relevant to human populations, and distinguishing sweeps from linked as well as neutrally evolving regions. Moreover, we show that S/HIC is uniquely robust among its competitors to model misspecification. Thus, even if the true demographic model of a population differs catastrophically from that specified by the user, S/HIC still retains impressive discriminatory power. Finally, we apply S/HIC to the case of resequencing data from human chromosome 18 in a European population sample, and demonstrate that we can reliably recover selective sweeps that have been identified earlier using less specific and sensitive methods.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Examples of the five classes used by S/HIC.
S/HIC classifies each window as a hard sweep (blue), linked to a hard sweep (purple), a soft sweep (red), linked to a soft sweep (orange), or neutral (gray). This classifier accomplishes this by examining values of various summary statistics in 11 different windows in order to infer the mode of evolution in the central window (the horizontal blue, purple, red, orange, and gray brackets). Regions that are centered on a hard (soft) selective sweep are defined as hard (soft). Regions that are not centered on selective sweeps but have their diversity impacted by a hard (soft) selective sweep but are not centered on the sweep are defined as hard-linked (soft-linked). Remaining windows are defined as neutral. S/HIC is trained on simulated examples of these five classes in order to distinguish selective sweeps from linked and neutral regions in population genomic data.
Fig 2
Fig 2. ROC curves showing the true and false positive rates of various methods/statistics when tasked with discriminating between regions containing a hard sweep and neutrally evolving regions.
A) For intermediate strengths of selection (α~U(2.5×102, 2.5×103)). B) For stronger selective sweeps (α~U(2.5×103, 2.5×104)). C) For weaker sweeps (α~U(2.5×101, 2.5×102)). Here, and for all other ROC curves unless otherwise noted, methods that require training from simulated sweeps were trained by combining three different training sets: one where α~U(2.5×101, 2.5×102), one where α~U(2.5×102, 2.5×103), and one where α~U(2.5×103, 2.5×104).
Fig 3
Fig 3. ROC curves showing the true and false positive rates of various methods/statistics when tasked with discriminating between regions containing a sweep (either hard or soft) and unselected regions (either neutral or linked to sweeps).
A) For intermediate strengths of selection (α~U(2.5×102, 2.5×103)). B) For stronger selective sweeps (α~U(2.5×103, 2.5×104)). C) For weaker sweeps (α~U(2.5×101, 2.5×102)).
Fig 4
Fig 4. Heatmaps showing the fraction of regions at varying distances from sweeps inferred to belong to each class by S/HIC, SFselect+, and evolBoosting+.
The location of any sweep relative to the classified window (or "Neutral" if there is no sweep) is shown on the y-axis, while the inferred class on the x-axis. Here, α~U(2.5×102, 2.5×103). A) Results for S/HIC. B) SFselect+. C) evolBoosting+.
Fig 5
Fig 5. Heatmaps showing the fraction of regions at varying distances from strong sweeps inferred to belong to each class by S/HIC, SFselect+, and evolBoosting+.
The location of any sweep relative to the classified window (or "Neutral" if there is no sweep) is shown on the y-axis, while the inferred class on the x-axis. Here, α~U(2.5×103, 2.5×104). A) Results for S/HIC. B) SFselect+. C) evolBoosting+.
Fig 6
Fig 6. ROC curves showing the true and false positive rates of various methods/statistics when tasked with discriminating between regions containing a sweep (either hard or soft) and unselected regions (either neutral or linked to sweeps) when testing on simulations with Tennessen et al.’s European demographic model.
Here, α~U(5×103, 5×105), and the methods that require training from simulated sweeps were trained from the same simulations with equilibrium demography as used for Figs 2–5. Note that Tajima’s D and Kim and Nielsen’s ω were omitted from this figure, as we simply used the values of these statistics to generate ROC curves without respect to any demographic model.
Fig 7
Fig 7. Heatmaps showing the fraction of regions simulated under Tennessen et al.’s European demographic model located at varying distances from sweeps inferred to belong to each class by S/HIC, SFselect+, and evolBoosting+.
The location of any sweep relative to the classified window (or "Neutral" if there is no sweep) is shown on the y-axis, while the inferred class on the x-axis. Here, α~U(5×103, 5×105). These three classifiers were trained from simulations with equilibrium demography. A) Results for S/HIC. B) SFselect+. C) evolBoosting+.
Fig 8
Fig 8. Browser screenshot showing patterns of variation around a putative selective sweep in Europeans within L3MBTL4 on chr18.
Values of π, Tajima’s D, Kelley’s ZnS, and Nielsen et al’s composite likelihood ratio, all from Pybus et al. [59], are shown. Beneath these statistics we show the classifications from S/HIC (red: hard sweep; faded red: hard-linked; blue: soft sweep; faded blue: soft-linked; black: neutral). This image was generated using the UCSC Genome Browser (http://genome.ucsc.edu).

References

    1. Akey JM. Constructing genomic maps of positive selection in humans: Where do we go from here? Genome Res. 2009;19(5):711–22. 10.1101/gr.086652.108 - DOI - PMC - PubMed
    1. Wollstein A, Stephan W. Inferring positive selection in humans from genomic data. Investigative Genetics. 2015;6(1):5. - PMC - PubMed
    1. Berry AJ, Ajioka J, Kreitman M. Lack of polymorphism on the Drosophila fourth chromosome resulting from selection. Genetics. 1991;129(4):1111–7. - PMC - PubMed
    1. Kaplan NL, Hudson R, Langley C. The" hitchhiking effect" revisited. Genetics. 1989;123(4):887–99. - PMC - PubMed
    1. Maynard Smith J, Haigh J. The hitch-hiking effect of a favourable gene. Genet Res. 1974;23(1):23–35. - PubMed

Publication types

LinkOut - more resources