Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun;21(6):936-9.
doi: 10.1101/gr.111120.110. Epub 2010 Oct 27.

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads

Affiliations

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads

Gerton Lunter et al. Genome Res. 2011 Jun.

Abstract

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Recall rates for four sets of 2 million simulated 72-bp paired-end reads, mapped back to the human reference by five read mapping algorithms. Reads included errors following an empirical distribution, as well as additional simulated polymorphisms: 0.1% single nucleotide variants (snp0.001), two single nucleotide variants per read (snp2), and a single large deletion or insertion per read pair (largedeletion and largeinsertion). For details of the simulation procedure, see Supplemental material.
Figure 2.
Figure 2.
Recall rates for simulated 72-bp paired-end reads, one of which overlaps a single insertion (A) or deletion (B) of various lengths (horizontal axes). For results for shorter and single-end reads, see Supplemental Figures S8 and S9. A read was required to overlap at least one correct base, but the indel was not required to be correctly called; for indel call rates, see Supplemental Figures S10 and S11.
Figure 3.
Figure 3.
Receiver operator characteristics for 72-bp paired-end reads, each of which overlaps a single insertion or deletion of 1–30 bp. For results for shorter and single-end reads, see Supplemental Figure S3.
Figure 4.
Figure 4.
Recall rates for 72-bp paired-end reads at a range of divergences to the human reference (horizontal axes; average number of substitutions per site). For results for shorter and single-end reads, see Supplemental Figure S12.
Figure 5.
Figure 5.
Pairwise concordance of independently mapped reads. The data (two human samples from the 1000 Genomes Project (The 1000 Genomes Project Consortium 2010); a divergent mouse subspecies; and human mRNA from an MCF-7 cell line; see text) were mapped to the human or mouse reference genomes (both NCBI build 37) by considering each read of a pair independently. Concordance was calculated as the proportion of reads that mapped to within 500 bp (for genomic DNA) or 10,000 bp (for the mRNA data set) of its mate.
Figure 6.
Figure 6.
Reference bias at heterozygous indel sites. The plot shows the cumulative distribution of the proportion of reads supporting the non-reference allele in an individual (NA12878) sequenced to high coverage in the 1000 Genomes Project (The 1000 Genomes Project Consortium 2010), and mapped using MAQ, BWA, and Stampy, across high-confidence heterozygous indel sites (see Supplemental material). A left shift of the curve indicates a bias toward the reference allele.

References

    1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
    1. Cox AJ 2007. ELAND: Efficient large-scale alignment of nucleotide databases. Illumina, San Diego
    1. Langmead B, Trapnell C, Pop M, Salzberg SL 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25 doi: 10.1186/gb-2009-10-3-r25 - PMC - PubMed
    1. Li H, Durbin R 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760 - PMC - PubMed
    1. Li H, Ruan J, Durbin R 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851–1858 - PMC - PubMed

Publication types