IntroUNET: Identifying introgressed alleles via semantic segmentation

doi:10.1371/journal.pgen.1010657

. 2024 Feb 20;20(2):e1010657.

doi: 10.1371/journal.pgen.1010657. eCollection 2024 Feb.

IntroUNET: Identifying introgressed alleles via semantic segmentation

Dylan D Ray¹, Lex Flagel^{2

3}, Daniel R Schrider¹

Affiliations

¹ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America.
² Division of Data Science, Gencove Inc., New York, New York, United States of America.
³ Department of Plant and Microbial Biology, University of Minnesota, Saint Paul, Minnesota, United States of America.

PMID: 38377104
PMCID: PMC10906877
DOI: 10.1371/journal.pgen.1010657

IntroUNET: Identifying introgressed alleles via semantic segmentation

Dylan D Ray et al. PLoS Genet. 2024.

. 2024 Feb 20;20(2):e1010657.

doi: 10.1371/journal.pgen.1010657. eCollection 2024 Feb.

Authors

Dylan D Ray¹, Lex Flagel^{2

3}, Daniel R Schrider¹

Affiliations

¹ Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America.
² Division of Data Science, Gencove Inc., New York, New York, United States of America.
³ Department of Plant and Microbial Biology, University of Minnesota, Saint Paul, Minnesota, United States of America.

PMID: 38377104
PMCID: PMC10906877
DOI: 10.1371/journal.pgen.1010657

Abstract

A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.

Copyright: © 2024 Ray et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Image representation of an example input tensor (left column) and its corresponding output (right column), from a simulated scenario of bidirectional gene flow.**
Here, the two populations are shown as separate matrices, although they are actually part of the same input tensor (i.e. they are the two values along the “channel” dimension in the tensor). The input alignments are represented as black and white images where the ancestral allele is shown in black and the derived allele in white. The output matrices show the locations of alleles in a recipient population that were introgressed from the donor population. Thus, the white pixels in the output for population 1 show alleles that were introgressed from population 2, and the white pixels in the output for population 2 represent alleles introgressed from population 1.

**Fig 2. UNet++ type architecture [68] used for all the problems in this paper.**
The black arrows represent a residual block consisting of two convolutions where each convolution in the series is summed to the previous, and the convolution layers are concatenated before a non-linear activation (ELU) [74] is applied. The example output of the network is color scaled from 0 to 1 and represents the probability of introgression at a given allele for a given individual. The loss function (represented by the bold $L$ ) is computed with the ground truth from the simulation and is the weighted binary cross entropy function (Eq 3). The weights and biases of the convolution operations are updated via gradient descent during training. The architecture we use for the problems discussed actually contains four down and up-sampling operations rather than the three portrayed here.

**Fig 3. Example inputs and outputs (both true and inferred) for each of the three problems we used to assess IntroUNET’s effectiveness.**
(A) A simulated example of the simple test scenario of a two-population split followed by recent single-pulse introgression event (bidirectional, in this case). The first column shows the population genetic alignments for this example, with the two panels corresponding to the two input channels (population 1 and population 2). The second shows the true histories of introgression for this example (again, with white pixels representing introgressed alleles); note that both population 1 and population 2 have introgressed alleles. The third and fourth columns show IntroUNET’s inference on this simulation, with the former showing the most probable class (i.e. introgression or no introgression) for each individual at each polymorphism, and the latter showing the inferred probability of introgression (i.e. the raw softmax output for the introgression class). The color bar for these plots is shown in panel (A), and the scaling is the same for the panels below as well. (B) A simulated example of the archaic ghost introgression scenario. The four columns are the same as in panel (A), but here we are examining a recipient population and a reference population, with the goal of identifying introgression only in the former. Thus, our output has only one population/channel. (C) A simulated example of our *Drosophila* introgression scenario. The four columns are the same as in (A) and (B), and here we are concerned with identifying introgression from *D. simulans* to *D. sechellia*, so again our output has only one channel (i.e. introgressed alleles in *D. sechellia*).

**Fig 4. Accuracy of IntroUNET on the simple introgression scenario.**
(A) Confusion matrix, precision-recall curve, and ROC curve showing IntroUNET’s accuracy when trained to detect introgression in the direction of population 1 to population 2 and tested on data with introgression in this same direction. (B) Same as (A), but for a network trained and tested in data with introgression from population 2 to population 1. (C) Same as (A) and (B), but for bidirectional introgression. Note that all of these metrics evaluate IntroUNET’s ability to accurately identify individual alleles (i.e. a prediction is made for each pixel in each input image in the test set, and the accuracy of this prediction is evaluated).

**Fig 5. Accuracy of IntroUNET and ArchIE on the archaic ghost introgression scenario.**
(A-B) Confusion matrices, (C) precision-recall curves, (D) and ROC curves showing IntroUNET’s and ArchIE’s [40] accuracy when trained to detect introgression from a ghost population to a recipient population when given population genetic data from the recipient population and a closely related reference population.

**Fig 6. Accuracy of IntroUNET on the *Drosophila* introgression scenario.**
(A) Confusion matrix for the uncalibrated IntroUNET when applied to test data simulated under the *Drosophila* scenario as specified in the Methods. (B) Confusion matrix for the reclibrated IntroUNET. (C) and (D) show the Precision-recall and ROC curves for the *Drosophila* IntroUNET; note that these curves are not affected by recalibration.

Fig 7. The distributions of predicted frequencies of introgressed haplotypes in A) genic (red) and intergenic (blue) regions across the genome and B) the sweep region on chr3R (blue) and other regions of the genome (red).

See this image and copyright information in PMC

Update of

IntroUNET: identifying introgressed alleles via semantic segmentation.
Ray DD, Flagel L, Schrider DR. Ray DD, et al. bioRxiv [Preprint]. 2024 Jan 23:2023.02.07.527435. doi: 10.1101/2023.02.07.527435. bioRxiv. 2024. Update in: PLoS Genet. 2024 Feb 20;20(2):e1010657. doi: 10.1371/journal.pgen.1010657. PMID: 36865105 Free PMC article. Updated. Preprint.

Cited by

Digital Image Processing to Detect Adaptive Evolution.
Amin MR, Hasan M, DeGiorgio M. Amin MR, et al. Mol Biol Evol. 2024 Dec 6;41(12):msae242. doi: 10.1093/molbev/msae242. Mol Biol Evol. 2024. PMID: 39565932 Free PMC article.
Tree sequences as a general-purpose tool for population genetic inference.
Whitehouse LS, Ray D, Schrider DR. Whitehouse LS, et al. bioRxiv [Preprint]. 2024 Oct 5:2024.02.20.581288. doi: 10.1101/2024.02.20.581288. bioRxiv. 2024. Update in: Mol Biol Evol. 2024 Nov 1;41(11):msae223. doi: 10.1093/molbev/msae223. PMID: 39185244 Free PMC article. Updated. Preprint.
INTERPRETING GENERATIVE ADVERSARIAL NETWORKS TO INFER NATURAL SELECTION FROM GENETIC DATA.
Riley R, Mathieson I, Mathieson S. Riley R, et al. bioRxiv [Preprint]. 2023 Jul 9:2023.03.07.531546. doi: 10.1101/2023.03.07.531546. bioRxiv. 2023. Update in: Genetics. 2024 Apr 3;226(4):iyae024. doi: 10.1093/genetics/iyae024. PMID: 36945387 Free PMC article. Updated. Preprint.
Tree Sequences as a General-Purpose Tool for Population Genetic Inference.
Whitehouse LS, Ray DD, Schrider DR. Whitehouse LS, et al. Mol Biol Evol. 2024 Nov 1;41(11):msae223. doi: 10.1093/molbev/msae223. Mol Biol Evol. 2024. PMID: 39460991 Free PMC article.
Estimation of spatial demographic maps from polymorphism data using a neural network.
Smith CCR, Patterson G, Ralph PL, Kern AD. Smith CCR, et al. Mol Ecol Resour. 2024 Oct;24(7):e14005. doi: 10.1111/1755-0998.14005. Epub 2024 Aug 16. Mol Ecol Resour. 2024. PMID: 39152666 Free PMC article.

References

1. Mallet J, Besansky N, Hahn MW. How reticulated are species? BioEssays. 2016;38(2):140–149. doi: 10.1002/bies.201500149 - DOI - PMC - PubMed
1. Rieseberg LH, Wendel JF, et al.. Introgression and its consequences in plants. Hybrid zones and the evolutionary process. 1993;70:109.
1. Suvorov A, Kim BY, Wang J, Armstrong EE, Peede D, D’agostino ER, et al.. Widespread introgression across a phylogeny of 155 Drosophila genomes. Current Biology. 2022;32(1):111–123. doi: 10.1016/j.cub.2021.10.052 - DOI - PMC - PubMed
1. Vanderpool D, Minh BQ, Lanfear R, Hughes D, Murali S, Harris RA, et al.. Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression. PLoS biology. 2020;18(12):e3000954. doi: 10.1371/journal.pbio.3000954 - DOI - PMC - PubMed
1. Arnegard ME, McGee MD, Matthews B, Marchinko KB, Conte GL, Kabir S, et al.. Genetics of ecological divergence during speciation. Nature. 2014;511(7509):307–311. doi: 10.1038/nature13301 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase

[1] Mallet J, Besansky N, Hahn MW. How reticulated are species? BioEssays. 2016;38(2):140–149. doi: 10.1002/bies.201500149 - DOI - PMC - PubMed

[2] Mallet J, Besansky N, Hahn MW. How reticulated are species? BioEssays. 2016;38(2):140–149. doi: 10.1002/bies.201500149 - DOI - PMC - PubMed

[3] Rieseberg LH, Wendel JF, et al.. Introgression and its consequences in plants. Hybrid zones and the evolutionary process. 1993;70:109.

[4] Rieseberg LH, Wendel JF, et al.. Introgression and its consequences in plants. Hybrid zones and the evolutionary process. 1993;70:109.

[5] Suvorov A, Kim BY, Wang J, Armstrong EE, Peede D, D’agostino ER, et al.. Widespread introgression across a phylogeny of 155 Drosophila genomes. Current Biology. 2022;32(1):111–123. doi: 10.1016/j.cub.2021.10.052 - DOI - PMC - PubMed

[6] Suvorov A, Kim BY, Wang J, Armstrong EE, Peede D, D’agostino ER, et al.. Widespread introgression across a phylogeny of 155 Drosophila genomes. Current Biology. 2022;32(1):111–123. doi: 10.1016/j.cub.2021.10.052 - DOI - PMC - PubMed

[7] Vanderpool D, Minh BQ, Lanfear R, Hughes D, Murali S, Harris RA, et al.. Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression. PLoS biology. 2020;18(12):e3000954. doi: 10.1371/journal.pbio.3000954 - DOI - PMC - PubMed

[8] Vanderpool D, Minh BQ, Lanfear R, Hughes D, Murali S, Harris RA, et al.. Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression. PLoS biology. 2020;18(12):e3000954. doi: 10.1371/journal.pbio.3000954 - DOI - PMC - PubMed

[9] Arnegard ME, McGee MD, Matthews B, Marchinko KB, Conte GL, Kabir S, et al.. Genetics of ecological divergence during speciation. Nature. 2014;511(7509):307–311. doi: 10.1038/nature13301 - DOI - PMC - PubMed

[10] Arnegard ME, McGee MD, Matthews B, Marchinko KB, Conte GL, Kabir S, et al.. Genetics of ecological divergence during speciation. Nature. 2014;511(7509):307–311. doi: 10.1038/nature13301 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

IntroUNET: Identifying introgressed alleles via semantic segmentation

Affiliations

IntroUNET: Identifying introgressed alleles via semantic segmentation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases