. 2009 Sep;5(9):e1000502.

doi: 10.1371/journal.pcbi.1000502. Epub 2009 Sep 11.

Fast mapping of short sequences with mismatches, insertions and deletions using index structures

Steve Hoffmann¹, Christian Otto, Stefan Kurtz, Cynthia M Sharma, Philipp Khaitovich, Jörg Vogel, Peter F Stadler, Jörg Hackermüller

Affiliations

PMID: 19750212
PMCID: PMC2730575
DOI: 10.1371/journal.pcbi.1000502

Fast mapping of short sequences with mismatches, insertions and deletions using index structures

Steve Hoffmann et al. PLoS Comput Biol. 2009 Sep.

. 2009 Sep;5(9):e1000502.

doi: 10.1371/journal.pcbi.1000502. Epub 2009 Sep 11.

Authors

Steve Hoffmann¹, Christian Otto, Stefan Kurtz, Cynthia M Sharma, Philipp Khaitovich, Jörg Vogel, Peter F Stadler, Jörg Hackermüller

Affiliation

¹ Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany.

PMID: 19750212
PMCID: PMC2730575
DOI: 10.1371/journal.pcbi.1000502

Abstract

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Longest prefix matches may fail to deliver the position of the optimally scoring local alignment.**
Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let *P_i* denote the *i^th* suffix of the read P. For each *P_i*, the starting positions of the longest match in S comprise the position of *P_i* in the best local alignment (solid green lines). That is, the longest match of P ₀ begins at position 3, the longest match of P ₁ begins at position 4, the longest match of P ₂ begins at position 5 and so forth. (B) For the read P: = cttcgtcggc, the retrieval of the best local alignment fails for all *P_i*, i<5 (dashed red line) due to the inclusion of a character that results in a mismatch in the optimally scoring local alignment. (C) The read P: = cttctgcggc contains, with respect to the best local alignment, a mismatch at position 5 of the read. Here the position 5 of the read is not included in the longest prefix match and nearly all substrings align correctly to the reference genome.

**Figure 2. Matching stems and matching branches.**
We give an explanation based on a suffix trie which is equivalent to the suffix interval tree shown in Fig. 5 (see Methods). The suffix trie for S$ with S: = acttcttcggc (left) holds twelve leaves. Each numbered leaf corresponds to exactly one suffix in S. Nodes with only one child are not explicitly shown. Note, that internal nodes implicitly represent all leafs in their respective subtree. Thus, internal nodes can be regarded as sets of suffixes. The right panel holds the longest matches for different matching paths in the trie. Matching the first three suffixes of the read P: = cgtcggc results in three different paths in the suffix trie. Each path is equivalent to a sequence of suffix intervals, a matching stem, in the enhanced suffix array. Let denote the matching stem for *P_i* = *i^th* suffix of P. The *q^th* interval in , denoted by , implicitly represents the set of suffixes in S matching P[i‥i+q−1]. The path for the first suffix P ₀ is of length two (green solid line). Hence, the equivalent matching stem is a sequence of three intervals: , and . Since only represents the suffix S ₇, the longest prefix match of P ₀ is of length 2 occurring at position 7 of the reference sequence (right panel). The matching stem for P ₁ (red solid line) ends with . Therefore, matches of length one occur at positions 8 and 9 in S. The longest prefix match for P ₃ occurs at position 6 of S (dashed orange line). Note, that the intervals of equivalently represent S ₆. An alternative path leads to a match with position 4. The branch denotes the alternative that accepts the mismatch of g and t at position 1 of P ₀.

formula image — **Figure 2. Matching stems and matching branches.**
We give an explanation based on a suffix trie which is equivalent to the suffix interval tree shown in Fig. 5 (see Methods). The suffix trie for S$ with S: = acttcttcggc (left) holds twelve leaves. Each numbered leaf corresponds to exactly one suffix in S. Nodes with only one child are not explicitly shown. Note, that internal nodes implicitly represent all leafs in their respective subtree. Thus, internal nodes can be regarded as sets of suffixes. The right panel holds the longest matches for different matching paths in the trie. Matching the first three suffixes of the read P: = cgtcggc results in three different paths in the suffix trie. Each path is equivalent to a sequence of suffix intervals, a matching stem, in the enhanced suffix array. Let denote the matching stem for *P_i* = *i^th* suffix of P. The *q^th* interval in , denoted by , implicitly represents the set of suffixes in S matching P[i‥i+q−1]. The path for the first suffix P ₀ is of length two (green solid line). Hence, the equivalent matching stem is a sequence of three intervals: , and . Since only represents the suffix S ₇, the longest prefix match of P ₀ is of length 2 occurring at position 7 of the reference sequence (right panel). The matching stem for P ₁ (red solid line) ends with . Therefore, matches of length one occur at positions 8 and 9 in S. The longest prefix match for P ₃ occurs at position 6 of S (dashed orange line). Note, that the intervals of equivalently represent S ₆. An alternative path leads to a match with position 4. The branch denotes the alternative that accepts the mismatch of g and t at position 1 of P ₀.

**Figure 3. Comparison of recall rates and running time for several short read aligners.**
Average running time for the different programs (A) in matching runs with 500 000 reads in two different data sets (logarithmic scale; S refers to segemehl). The differences are uniformly distributed and consist of only mismatches (B) or mismatches, insertions and deletions (C). The recall rate describes the fraction of reads which was mapped to the correct position. All programs were used with default parameters. Bowtie was called with option –all and SOAP with option –r 2.

**Figure 4. segemehl recall rates for varying difference values and distributions.**
Recall rates are depicted for k = 0 (dashed) and k = 1 (solid). For terminal–, 3′– and 5′– increased difference distributions, segemehl achieves a recall rate above 80% for reads with 4 errors.

**Figure 5. The enhanced suffix array yields a tree structure of nested suffix intervals.**
The enhanced suffix array for the sequence S: = attcttcggc (left) and its suffix interval tree (right), equivalent to the suffix trie in Fig. 2, is shown. The array suf represents the lexicographical order of the suffixes in S$. In other words, S _suf[0], S _suf[1], …, S _suf[n] is the sequence of suffixes of S$ in ascending lexicographic order. The lcp-table lcp is an array of integers such that for each h, 1≤h≤n, lcp[h] is the length of the longest common prefix of S _suf[h−1] and S _suf[h]. A suffix interval [l‥r, h] denotes an interval in the suffix array with lcp[i]≥h for all i, l+1≤i≤r, i.e. all suffixes in the interval [l+1‥r] have a longest common prefix of length at least h. Additionally, requiring l = 0 or lcp[l]<h makes the suffix interval left maximal and requiring r = n or lcp[r+1]<h makes it right maximal. The suffix interval [0‥10, 0] spans the whole suffix array and is equivalent to the root of a suffix interval tree. This interval contains five subintervals, one for each character in S$, with h = 1. Equivalently, the root node of the suffix interval tree has five children. Note, that two children, labeled by 0 and 11, are singletons. The child nodes of singletons are not explicitly shown here.

**Figure 6. The branch closure.**
The suffix interval [l‥r, h], representing some string of length h, is split into its children [l‥u, h+1], [u+1‥v, h+1] and [v+1‥r, h+1] by matching an additional character a∈{A, C, T}. We proceed building by matching the character C (solid bold green line). Beforehand, alternative suffix intervals are stored in , either representing mismatches (dashed red line), insertions (dashed dotted black line) or deletions (dotted blue line). holds suffix link intervals that in turn branch off from . The branch closure holds all such alternative intervals.

**Figure 7. Algorithm.**
Enumeration of exact and inexact seeds.

See this image and copyright information in PMC

References

1. Myers G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM. 1999;46:395–415.
1. Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discr Algorithms. 2004;2:53–86.
1. Rothberg JM, Leamon JH. The development and impact of 454 sequencing. Nat Biotechnol. 2008;26:1117–1124. - PubMed
1. Bennett S. Solexa Ltd. Pharmacogenomics. 2004;5:433–438. - PubMed
1. Huse S, Huber J, Morrison H, Sogin M, Welch D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology. 2007;8:R143. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast mapping of short sequences with mismatches, insertions and deletions using index structures

Affiliation

Fast mapping of short sequences with mismatches, insertions and deletions using index structures

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources