On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't

Eran Elhaik¹, Dan Graur²

Affiliations

¹ Department of Biology, Lund University, Sölvegatan 35, 22362 Lund, Sweden.
² Department of Biology & Biochemistry, University of Houston, Science & Research Building 2, Suite #342, 3455 Cullen Bldv., Houston, TX 77204-5001, USA.

PMID: 33916341
PMCID: PMC8066263
DOI: 10.3390/genes12040527

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't

Eran Elhaik et al. Genes (Basel). 2021.

. 2021 Apr 5;12(4):527.

doi: 10.3390/genes12040527.

Authors

Eran Elhaik¹, Dan Graur²

Affiliations

¹ Department of Biology, Lund University, Sölvegatan 35, 22362 Lund, Sweden.
² Department of Biology & Biochemistry, University of Houston, Science & Research Building 2, Suite #342, 3455 Cullen Bldv., Houston, TX 77204-5001, USA.

PMID: 33916341
PMCID: PMC8066263
DOI: 10.3390/genes12040527

Abstract

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled "Soft sweeps are the dominant mode of adaptation in the human genome" (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863-1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366-1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern's paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Keywords: artificial intelligence (AI); evolutionary biology; molecular and genome evolution; population size; selective sweeps; supervised machine learning (SML).

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Population structure and demography. Changes in population size over time for six populations. (A) Changes to the effective population sizes over time in six populations inferred by PSMC. Lines represent the within-population median PSMC estimate, smoothed by fitting a cubic spline passing through bin midpoints. The original figure was published by Auton et al. (The 1000 Genomes Project Consortium 2015, Figure 2). This figure was created using code and data provided by Dr. Adam Auton to include only the relevant populations. The plot is log-scaled for the X-axis. (B) Plotting the Schrider and Kern’s [4] data for the Auton et al.’s figure. Schrider and Kern (2017) sampled 26 data points from (A) and scaled them by θ and N₀. We θ-scaled the X-axis (to get each population on the same timescale) and log-scaled it to increase the similarity with (A).

**Figure 2**
Illustration of the annotation method for the simulated “training data” used by Schrider and Kern [4]. Since Schrider and Kern [4] lacked true training data derived from a sample of the true genomic data with the features of interest, they simulated their own dataset. To annotate it so that it can be used to train their classifier, they randomly selected 1.1 Mb regions from the human genome, annotated them using public datasets like *phastCons*, and copied the annotation to their simulated data. To illustrate the problem with this approach, we start with a real sequence from the human genome (top) for which an annotation exists in phastCons. Let us assume that within this sequence, one region was found to be extremely conserved (red), i.e., subject to strong purifying selection. We then take another string of letters of identical length (bottom), call it the training sequence, and annotate the corresponding positions as “evolving neutrally.” If the “training” sequence is the start of the first sentence in *A Tale of Two Cities* by Charles Dickens (1859), then the string “… s the worst…” will be deemed to have evolved under purifying selection.

See this image and copyright information in PMC

References

1. Jensen J.D. On the unfounded enthusiasm for soft selective sweeps. Nat. Commun. 2014;5:5281. doi: 10.1038/ncomms6281. - DOI - PubMed
1. Harris R.B., Sackman A., Jensen J.D. On the unfounded enthusiasm for soft selective sweeps II: Examining recent evidence from humans, flies, and viruses. PLoS Genet. 2018;14:e1007859. doi: 10.1371/journal.pgen.1007859. - DOI - PMC - PubMed
1. Schrider D.R., Kern A.D. S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning. PLoS Genet. 2016;12:e1005928. doi: 10.1371/journal.pgen.1005928. - DOI - PMC - PubMed
1. Schrider D.R., Kern A.D. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome. Mol. Biol. Evol. 2017;34:1863–1877. doi: 10.1093/molbev/msx154. - DOI - PMC - PubMed
1. Kern A.D., Schrider D.R. diploS/HIC: An Updated Approach to Classifying Selective Sweeps. G3 Genes Genomes Genet. 2018;8:1959–1970. doi: 10.1534/g3.118.200262. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't

Affiliations

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources