Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads

Gerton Lunter¹, Martin Goodson

Affiliations

PMID: 20980556
PMCID: PMC3106326
DOI: 10.1101/gr.111120.110

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads

Gerton Lunter et al. Genome Res. 2011 Jun.

. 2011 Jun;21(6):936-9.

doi: 10.1101/gr.111120.110. Epub 2010 Oct 27.

Authors

Gerton Lunter¹, Martin Goodson

Affiliation

¹ Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom. gerton.lunter@well.ox.ac.uk

PMID: 20980556
PMCID: PMC3106326
DOI: 10.1101/gr.111120.110

Abstract

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

PubMed Disclaimer

Figures

**Figure 1.**
Recall rates for four sets of 2 million simulated 72-bp paired-end reads, mapped back to the human reference by five read mapping algorithms. Reads included errors following an empirical distribution, as well as additional simulated polymorphisms: 0.1% single nucleotide variants (snp0.001), two single nucleotide variants per read (snp2), and a single large deletion or insertion per read pair (largedeletion and largeinsertion). For details of the simulation procedure, see Supplemental material.

**Figure 2.**
Recall rates for simulated 72-bp paired-end reads, one of which overlaps a single insertion (A) or deletion (B) of various lengths (horizontal axes). For results for shorter and single-end reads, see Supplemental Figures S8 and S9. A read was required to overlap at least one correct base, but the indel was not required to be correctly called; for indel call rates, see Supplemental Figures S10 and S11.

**Figure 3.**
Receiver operator characteristics for 72-bp paired-end reads, each of which overlaps a single insertion or deletion of 1–30 bp. For results for shorter and single-end reads, see Supplemental Figure S3.

**Figure 4.**
Recall rates for 72-bp paired-end reads at a range of divergences to the human reference (horizontal axes; average number of substitutions per site). For results for shorter and single-end reads, see Supplemental Figure S12.

**Figure 5.**
Pairwise concordance of independently mapped reads. The data (two human samples from the 1000 Genomes Project (The 1000 Genomes Project Consortium 2010); a divergent mouse subspecies; and human mRNA from an MCF-7 cell line; see text) were mapped to the human or mouse reference genomes (both NCBI build 37) by considering each read of a pair independently. Concordance was calculated as the proportion of reads that mapped to within 500 bp (for genomic DNA) or 10,000 bp (for the mRNA data set) of its mate.

**Figure 6.**
Reference bias at heterozygous indel sites. The plot shows the cumulative distribution of the proportion of reads supporting the non-reference allele in an individual (NA12878) sequenced to high coverage in the 1000 Genomes Project (The 1000 Genomes Project Consortium 2010), and mapped using MAQ, BWA, and Stampy, across high-confidence heterozygous indel sites (see Supplemental material). A left shift of the curve indicates a bias toward the reference allele.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
1. Cox AJ 2007. ELAND: Efficient large-scale alignment of nucleotide databases. Illumina, San Diego
1. Langmead B, Trapnell C, Pop M, Salzberg SL 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25 doi: 10.1186/gb-2009-10-3-r25 - PMC - PubMed
1. Li H, Durbin R 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760 - PMC - PubMed
1. Li H, Ruan J, Durbin R 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851–1858 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads

Affiliation

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases