. 2013 Sep;30(9):2224-34.

doi: 10.1093/molbev/mst112. Epub 2013 Jun 18.

SweeD: likelihood-based detection of selective sweeps in thousands of genomes

Pavlos Pavlidis¹, Daniel Živkovic, Alexandros Stamatakis, Nikolaos Alachiotis

Affiliations

Affiliation

¹ Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg, Heidelberg, Germany. pavlidisp@gmail.com

PMID: 23777627
PMCID: PMC3748355
DOI: 10.1093/molbev/mst112

SweeD: likelihood-based detection of selective sweeps in thousands of genomes

Pavlos Pavlidis et al. Mol Biol Evol. 2013 Sep.

. 2013 Sep;30(9):2224-34.

doi: 10.1093/molbev/mst112. Epub 2013 Jun 18.

Authors

Pavlos Pavlidis¹, Daniel Živkovic, Alexandros Stamatakis, Nikolaos Alachiotis

Affiliation

¹ Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg, Heidelberg, Germany. pavlidisp@gmail.com

PMID: 23777627
PMCID: PMC3748355
DOI: 10.1093/molbev/mst112

Abstract

The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.

Keywords: high-performance computing; positive selection; selective sweep; site frequency spectrum.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1. — **Fig. 1.**
Comparison of peak memory consumption between SweeD and SweepFinder. Simulated data sets of 100 SNPs (A) and 100,000 SNPs (B) and 25, 50, 100, 200, and 400 respective sequences were used for the measurements. Memory consumption was quantified with the massif tool of the valgrind software (Seward and Nethercote 2005). In most cases, SweeD consumes more memory than SweepFinder due to the lookup table implementation. However, memory consumption is in the order of MBs even for very large data sets.

F<sc>ig</sc>. 2. — **Fig. 2.**
Speedup measurements using up to 48 cores for the analysis of simulated data sets consisting of 100 (A) and 10,000 (B) sequences with 10,000, 100,000, and 1,000,000 SNPs, respectively.

F<sc>ig</sc>. 3. — **Fig. 3.**
Assessment of the accuracy of predicting the selective sweep position for various sample sizes. The x axis in both plots shows the distance d of the reported selective sweep position from the true selective sweep position. Distance is grouped in bins of size 10,000, i.e., d₁ = 10,000, d₂ = 20,000, … , d₂₀ = 200,000. For each bin i, the y axis shows the frequency of simulated data sets with a reported selective sweep position at a distance less than *d_i.* (A) Plot refers to a constant population model and (B) refers to a bottlenecked population model. Details regarding the simulation parameters are described in the main text. The straight line depicts the expected percentage of simulations at each bin, if the position of a reported selective sweep would be distributed uniformly along the simulated fragment of 400 kb. The figure shows that the accuracy of detecting selective sweeps increases with the sample size in both constant-size and bottlenecked populations. By comparing A with B, we see that the detection of a selective sweep is more accurate in constant-size than in bottlenecked populations.

F<sc>ig</sc>. 4. — **Fig. 4.**
Scan of the human chromosome 1 for selective sweeps. (A) The x axis denotes the position on chromosome 1, and the y axis shows the CLR evaluated by SweeD (upper panel) and the ω-statistic (bottom panel) evaluated by OmegaPlus. (B) The joint plot for SweeD and OmegaPlus. Red points denote outliers at a significance level of 1%. The genes located in the outlier regions are described in the supplementary material (supplementary table S1 in supplementary section S4, Supplementary Material online).

F<sc>ig</sc>. 5. — **Fig. 5.**
Comparison of the time (in seconds) required to estimate the average SFS by using either simulations or SweeD. The four thin black lines represent time needed by simulating a sample from a bottlenecked population 10 times (solid line), 100 times (dashed line), or 1,000 times (dotted line). Number of simulated replications affects the accuracy of estimation; more replications result in more accurate estimation (supplementary fig. S3 in supplementary section S5, Supplementary Material online). The thick gray line shows the time needed for SweeD (with the MPFR library) to estimate the average SFS of the same demographic scenario. The command line used for generating the simulated data sets is provided in the supplementary section S6, Supplementary Material online.

F<sc>ig</sc>. 6. — **Fig. 6.**
Comparison of memory consumption (A) and run-time (B) of SweeD (where the SFS is computed by the data itself) and SweeD using the MPFR library to calculate the analytical SFS. Simulated standard neutral data sets of 500 SNPs and 25, 50, 100, 200, and 400 sequences were used for the measurements. Memory consumption was quantified with the massif tool of the valgrind software (Seward and Nethercote 2005).

See this image and copyright information in PMC

Cited by

Detection of selection signatures in farmed coho salmon (Oncorhynchus kisutch) using dense genome-wide information.
López ME, Cádiz MI, Rondeau EB, Koop BF, Yáñez JM. López ME, et al. Sci Rep. 2021 May 6;11(1):9685. doi: 10.1038/s41598-021-86154-w. Sci Rep. 2021. PMID: 33958603 Free PMC article.
Deciphering the Genetic Landscape: Insights Into the Genomic Signatures of Changle Goose.
Chen H, Wu Y, Zhu Y, Luo K, Zheng S, Tang H, Xuan R, Huang Y, Li J, Xiong R, Fang X, Wang L, Gong Y, Miao J, Zhou J, Tan H, Wang Y, Wu L, Ouyang J, Huang M, Yan X. Chen H, et al. Evol Appl. 2024 Aug 22;17(8):e13768. doi: 10.1111/eva.13768. eCollection 2024 Aug. Evol Appl. 2024. PMID: 39175938 Free PMC article.
Forty Years of Inferential Methods in the Journals of the Society for Molecular Biology and Evolution.
Russo CAM, Eyre-Walker A, Katz LA, Gaut BS. Russo CAM, et al. Mol Biol Evol. 2024 Jan 3;41(1):msad264. doi: 10.1093/molbev/msad264. Mol Biol Evol. 2024. PMID: 38197288 Free PMC article.
Genomic analyses provide insights into peach local adaptation and responses to climate change.
Li Y, Cao K, Li N, Zhu G, Fang W, Chen C, Wang X, Guo J, Wang Q, Ding T, Wang J, Guan L, Wang J, Liu K, Guo W, Arús P, Huang S, Fei Z, Wang L. Li Y, et al. Genome Res. 2021 Apr;31(4):592-606. doi: 10.1101/gr.261032.120. Epub 2021 Mar 9. Genome Res. 2021. PMID: 33687945 Free PMC article.
Extensive crop-wild hybridization during Brassica evolution and selection during the domestication and diversification of Brassica crops.
Saban JM, Romero AJ, Ezard THG, Chapman MA. Saban JM, et al. Genetics. 2023 Apr 6;223(4):iyad027. doi: 10.1093/genetics/iyad027. Genetics. 2023. PMID: 36810660 Free PMC article.

See all "Cited by" articles

References

1. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
1. Alachiotis N, Stamatakis A, Pavlidis P. OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics. 2012;28:2274–2275. - PubMed
1. Ansel J, Arya K, Cooperman G. 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS’09). IEEE (Montreal, Canada) 2009. DMTCP: transparent checkpointing for cluster computations and the desktop; pp. 1–12.
1. Chen GK, Marjoram P, Wall JD. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19:136–142. - PMC - PubMed
1. Evans SN, Shvets Y, Slatkin M. Non-equilibrium theory of the allele frequency spectrum. Theor Popul Biol. 2007;71:109–119. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SweeD: likelihood-based detection of selective sweeps in thousands of genomes

Affiliation

SweeD: likelihood-based detection of selective sweeps in thousands of genomes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous