Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep;30(9):2224-34.
doi: 10.1093/molbev/mst112. Epub 2013 Jun 18.

SweeD: likelihood-based detection of selective sweeps in thousands of genomes

Affiliations

SweeD: likelihood-based detection of selective sweeps in thousands of genomes

Pavlos Pavlidis et al. Mol Biol Evol. 2013 Sep.

Abstract

The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.

Keywords: high-performance computing; positive selection; selective sweep; site frequency spectrum.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Comparison of peak memory consumption between SweeD and SweepFinder. Simulated data sets of 100 SNPs (A) and 100,000 SNPs (B) and 25, 50, 100, 200, and 400 respective sequences were used for the measurements. Memory consumption was quantified with the massif tool of the valgrind software (Seward and Nethercote 2005). In most cases, SweeD consumes more memory than SweepFinder due to the lookup table implementation. However, memory consumption is in the order of MBs even for very large data sets.
F<sc>ig</sc>. 2.
Fig. 2.
Speedup measurements using up to 48 cores for the analysis of simulated data sets consisting of 100 (A) and 10,000 (B) sequences with 10,000, 100,000, and 1,000,000 SNPs, respectively.
F<sc>ig</sc>. 3.
Fig. 3.
Assessment of the accuracy of predicting the selective sweep position for various sample sizes. The x axis in both plots shows the distance d of the reported selective sweep position from the true selective sweep position. Distance is grouped in bins of size 10,000, i.e., d1 = 10,000, d2 = 20,000, … , d20 = 200,000. For each bin i, the y axis shows the frequency of simulated data sets with a reported selective sweep position at a distance less than di. (A) Plot refers to a constant population model and (B) refers to a bottlenecked population model. Details regarding the simulation parameters are described in the main text. The straight line depicts the expected percentage of simulations at each bin, if the position of a reported selective sweep would be distributed uniformly along the simulated fragment of 400 kb. The figure shows that the accuracy of detecting selective sweeps increases with the sample size in both constant-size and bottlenecked populations. By comparing A with B, we see that the detection of a selective sweep is more accurate in constant-size than in bottlenecked populations.
F<sc>ig</sc>. 4.
Fig. 4.
Scan of the human chromosome 1 for selective sweeps. (A) The x axis denotes the position on chromosome 1, and the y axis shows the CLR evaluated by SweeD (upper panel) and the ω-statistic (bottom panel) evaluated by OmegaPlus. (B) The joint plot for SweeD and OmegaPlus. Red points denote outliers at a significance level of 1%. The genes located in the outlier regions are described in the supplementary material (supplementary table S1 in supplementary section S4, Supplementary Material online).
F<sc>ig</sc>. 5.
Fig. 5.
Comparison of the time (in seconds) required to estimate the average SFS by using either simulations or SweeD. The four thin black lines represent time needed by simulating a sample from a bottlenecked population 10 times (solid line), 100 times (dashed line), or 1,000 times (dotted line). Number of simulated replications affects the accuracy of estimation; more replications result in more accurate estimation (supplementary fig. S3 in supplementary section S5, Supplementary Material online). The thick gray line shows the time needed for SweeD (with the MPFR library) to estimate the average SFS of the same demographic scenario. The command line used for generating the simulated data sets is provided in the supplementary section S6, Supplementary Material online.
F<sc>ig</sc>. 6.
Fig. 6.
Comparison of memory consumption (A) and run-time (B) of SweeD (where the SFS is computed by the data itself) and SweeD using the MPFR library to calculate the analytical SFS. Simulated standard neutral data sets of 500 SNPs and 25, 50, 100, 200, and 400 sequences were used for the measurements. Memory consumption was quantified with the massif tool of the valgrind software (Seward and Nethercote 2005).

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
    1. Alachiotis N, Stamatakis A, Pavlidis P. OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics. 2012;28:2274–2275. - PubMed
    1. Ansel J, Arya K, Cooperman G. 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS’09). IEEE (Montreal, Canada) 2009. DMTCP: transparent checkpointing for cluster computations and the desktop; pp. 1–12.
    1. Chen GK, Marjoram P, Wall JD. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19:136–142. - PMC - PubMed
    1. Evans SN, Shvets Y, Slatkin M. Non-equilibrium theory of the allele frequency spectrum. Theor Popul Biol. 2007;71:109–119. - PubMed

Publication types