Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2012 Jun;19(6):796-813.
doi: 10.1089/cmb.2012.0022. Epub 2012 Apr 16.

Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis

Affiliations
Comparative Study

Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis

Sophie Schbath et al. J Comput Biol. 2012 Jun.

Abstract

Mapping short reads against a reference genome is classically the first step of many next-generation sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
The hashing algorithm. (A) The genome is cut into overlapping 3-mers, and their respective positions in the genome are stored. (B) The read is cut into 3-mers. The 3-mers from the reads are compared to 3-mers from the genome using a hashing procedure. (C) Positions for each seed are sorted and compared to the other seeds. (D) Compatible positions are kept.
FIG. 2.
FIG. 2.
The seed and extend algorithm. (A) A read ATCA is sought for in GATTACA, using seeds of size 2, with one error. Each seed maps once (left part). After extension of each seed (right part), it turns out that only one mapping contains only one error. (B) A read GTCA is sought for in GATTACA, using spaced seeds of size 2, with two errors. Notice that using simple seeds would not retrieve any hit. The spaced seed mask is used to generate the seeds: xC and xA. The hashing algorithm retrieves many hits. After extension, three hits are kept.
FIG. 3.
FIG. 3.
The suffix tree of the genome GATTACA. Dotted arrows indicate that the tree continues there. Double circle indicate that a suffix ends there.
FIG. 4.
FIG. 4.
Suffix array of the genome GATTACA. Last column is the Burrows-Wheeler transform.
FIG. 5.
FIG. 5.
Reconstructing the suffix array.
FIG. 6.
FIG. 6.
Looking for the read GAT.
FIG. 7.
FIG. 7.
The magic trick explained.
FIG. 8.
FIG. 8.
Histogram of the logarithm of the number of occurrences of the 1,122,893 (respectively 2,620,394) reads from formula image (resp. formula image) occurring more than once in the reference genome (left) (resp. (right)).

References

    1. Chen Y. Souaiaia T. Chen T. PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics. 2009;5:2514–2521. - PMC - PubMed
    1. David M. Dzamba M. Lister D., et al. SHRiMP2: sensitive yet practical short read mapping. Bioinformatics. 2011;27:1011–1012. - PubMed
    1. Ferragina P. Manzini G. Opportunistic data structures with applications. Proc. 41st Symp. Found. Comput. Sci. 2000:390–398.
    1. Homer N. Merriman B. Nelson SF. BFAST: an alignment tool for large scale genome resequencing. PLoS ONE. 2009;4:e7767. - PMC - PubMed
    1. Jiang H. Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008;24:2395–2396. - PMC - PubMed

Publication types

LinkOut - more resources