Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 5;12(4):527.
doi: 10.3390/genes12040527.

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't

Affiliations

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't

Eran Elhaik et al. Genes (Basel). .

Abstract

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled "Soft sweeps are the dominant mode of adaptation in the human genome" (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863-1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366-1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern's paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Keywords: artificial intelligence (AI); evolutionary biology; molecular and genome evolution; population size; selective sweeps; supervised machine learning (SML).

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Population structure and demography. Changes in population size over time for six populations. (A) Changes to the effective population sizes over time in six populations inferred by PSMC. Lines represent the within-population median PSMC estimate, smoothed by fitting a cubic spline passing through bin midpoints. The original figure was published by Auton et al. (The 1000 Genomes Project Consortium 2015, Figure 2). This figure was created using code and data provided by Dr. Adam Auton to include only the relevant populations. The plot is log-scaled for the X-axis. (B) Plotting the Schrider and Kern’s [4] data for the Auton et al.’s figure. Schrider and Kern (2017) sampled 26 data points from (A) and scaled them by θ and N0. We θ-scaled the X-axis (to get each population on the same timescale) and log-scaled it to increase the similarity with (A).
Figure 2
Figure 2
Illustration of the annotation method for the simulated “training data” used by Schrider and Kern [4]. Since Schrider and Kern [4] lacked true training data derived from a sample of the true genomic data with the features of interest, they simulated their own dataset. To annotate it so that it can be used to train their classifier, they randomly selected 1.1 Mb regions from the human genome, annotated them using public datasets like phastCons, and copied the annotation to their simulated data. To illustrate the problem with this approach, we start with a real sequence from the human genome (top) for which an annotation exists in phastCons. Let us assume that within this sequence, one region was found to be extremely conserved (red), i.e., subject to strong purifying selection. We then take another string of letters of identical length (bottom), call it the training sequence, and annotate the corresponding positions as “evolving neutrally.” If the “training” sequence is the start of the first sentence in A Tale of Two Cities by Charles Dickens (1859), then the string “… s the worst…” will be deemed to have evolved under purifying selection.

Similar articles

Cited by

References

    1. Jensen J.D. On the unfounded enthusiasm for soft selective sweeps. Nat. Commun. 2014;5:5281. doi: 10.1038/ncomms6281. - DOI - PubMed
    1. Harris R.B., Sackman A., Jensen J.D. On the unfounded enthusiasm for soft selective sweeps II: Examining recent evidence from humans, flies, and viruses. PLoS Genet. 2018;14:e1007859. doi: 10.1371/journal.pgen.1007859. - DOI - PMC - PubMed
    1. Schrider D.R., Kern A.D. S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning. PLoS Genet. 2016;12:e1005928. doi: 10.1371/journal.pgen.1005928. - DOI - PMC - PubMed
    1. Schrider D.R., Kern A.D. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome. Mol. Biol. Evol. 2017;34:1863–1877. doi: 10.1093/molbev/msx154. - DOI - PMC - PubMed
    1. Kern A.D., Schrider D.R. diploS/HIC: An Updated Approach to Classifying Selective Sweeps. G3 Genes Genomes Genet. 2018;8:1959–1970. doi: 10.1534/g3.118.200262. - DOI - PMC - PubMed

Publication types

LinkOut - more resources