Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Sep;5(9):e1000502.
doi: 10.1371/journal.pcbi.1000502. Epub 2009 Sep 11.

Fast mapping of short sequences with mismatches, insertions and deletions using index structures

Affiliations

Fast mapping of short sequences with mismatches, insertions and deletions using index structures

Steve Hoffmann et al. PLoS Comput Biol. 2009 Sep.

Abstract

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Longest prefix matches may fail to deliver the position of the optimally scoring local alignment.
Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let Pi denote the ith suffix of the read P. For each Pi, the starting positions of the longest match in S comprise the position of Pi in the best local alignment (solid green lines). That is, the longest match of P 0 begins at position 3, the longest match of P 1 begins at position 4, the longest match of P 2 begins at position 5 and so forth. (B) For the read P: = cttcgtcggc, the retrieval of the best local alignment fails for all Pi, i<5 (dashed red line) due to the inclusion of a character that results in a mismatch in the optimally scoring local alignment. (C) The read P: = cttctgcggc contains, with respect to the best local alignment, a mismatch at position 5 of the read. Here the position 5 of the read is not included in the longest prefix match and nearly all substrings align correctly to the reference genome.
Figure 2
Figure 2. Matching stems and matching branches.
We give an explanation based on a suffix trie which is equivalent to the suffix interval tree shown in Fig. 5 (see Methods). The suffix trie for S$ with S: = acttcttcggc (left) holds twelve leaves. Each numbered leaf corresponds to exactly one suffix in S. Nodes with only one child are not explicitly shown. Note, that internal nodes implicitly represent all leafs in their respective subtree. Thus, internal nodes can be regarded as sets of suffixes. The right panel holds the longest matches for different matching paths in the trie. Matching the first three suffixes of the read P: = cgtcggc results in three different paths in the suffix trie. Each path is equivalent to a sequence of suffix intervals, a matching stem, in the enhanced suffix array. Let formula image denote the matching stem for Pi = ith suffix of P. The qth interval in formula image, denoted by formula image, implicitly represents the set of suffixes in S matching P[ii+q−1]. The path for the first suffix P 0 is of length two (green solid line). Hence, the equivalent matching stem formula image is a sequence of three intervals: formula image, formula image and formula image. Since formula image only represents the suffix S 7, the longest prefix match of P 0 is of length 2 occurring at position 7 of the reference sequence (right panel). The matching stem formula image for P 1 (red solid line) ends with formula image. Therefore, matches of length one occur at positions 8 and 9 in S. The longest prefix match for P 3 occurs at position 6 of S (dashed orange line). Note, that the intervals formula image of formula image equivalently represent S 6. An alternative path leads to a match with position 4. The branch formula image denotes the alternative that accepts the mismatch of g and t at position 1 of P 0.
Figure 3
Figure 3. Comparison of recall rates and running time for several short read aligners.
Average running time for the different programs (A) in matching runs with 500 000 reads in two different data sets (logarithmic scale; S refers to segemehl). The differences are uniformly distributed and consist of only mismatches (B) or mismatches, insertions and deletions (C). The recall rate describes the fraction of reads which was mapped to the correct position. All programs were used with default parameters. Bowtie was called with option –all and SOAP with option –r 2.
Figure 4
Figure 4. segemehl recall rates for varying difference values and distributions.
Recall rates are depicted for k = 0 (dashed) and k = 1 (solid). For terminal–, 3′– and 5′– increased difference distributions, segemehl achieves a recall rate above 80% for reads with 4 errors.
Figure 5
Figure 5. The enhanced suffix array yields a tree structure of nested suffix intervals.
The enhanced suffix array for the sequence S: = attcttcggc (left) and its suffix interval tree (right), equivalent to the suffix trie in Fig. 2, is shown. The array suf represents the lexicographical order of the suffixes in S$. In other words, S suf[0], S suf[1], …, S suf[n] is the sequence of suffixes of S$ in ascending lexicographic order. The lcp-table lcp is an array of integers such that for each h, 1≤hn, lcp[h] is the length of the longest common prefix of S suf[h−1] and S suf[h]. A suffix interval [lr, h] denotes an interval in the suffix array with lcp[i]≥h for all i, l+1≤ir, i.e. all suffixes in the interval [l+1‥r] have a longest common prefix of length at least h. Additionally, requiring l = 0 or lcp[l]<h makes the suffix interval left maximal and requiring r = n or lcp[r+1]<h makes it right maximal. The suffix interval [0‥10, 0] spans the whole suffix array and is equivalent to the root of a suffix interval tree. This interval contains five subintervals, one for each character in S$, with h = 1. Equivalently, the root node of the suffix interval tree has five children. Note, that two children, labeled by 0 and 11, are singletons. The child nodes of singletons are not explicitly shown here.
Figure 6
Figure 6. The branch closure.
The suffix interval [lr, h], representing some string formula image of length h, is split into its children [lu, h+1], [u+1‥v, h+1] and [v+1‥r, h+1] by matching an additional character a∈{A, C, T}. We proceed building formula image by matching the character C (solid bold green line). Beforehand, alternative suffix intervals are stored in formula image, either representing mismatches (dashed red line), insertions (dashed dotted black line) or deletions (dotted blue line). formula image holds suffix link intervals that in turn branch off from formula image. The branch closure formula image holds all such alternative intervals.
Figure 7
Figure 7. Algorithm.
Enumeration of exact and inexact seeds.

References

    1. Myers G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM. 1999;46:395–415.
    1. Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discr Algorithms. 2004;2:53–86.
    1. Rothberg JM, Leamon JH. The development and impact of 454 sequencing. Nat Biotechnol. 2008;26:1117–1124. - PubMed
    1. Bennett S. Solexa Ltd. Pharmacogenomics. 2004;5:433–438. - PubMed
    1. Huse S, Huber J, Morrison H, Sogin M, Welch D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology. 2007;8:R143. - PMC - PubMed

Publication types