Lossless filter for multiple repeats with bounded edit distance

Pierre Peterlongo¹, Gustavo Akio Tominaga Sacomoto, Alair Pereira do Lago, Nadia Pisanti, Marie-France Sagot

Affiliations

PMID: 19183438
PMCID: PMC2661881
DOI: 10.1186/1748-7188-4-3

Lossless filter for multiple repeats with bounded edit distance

Pierre Peterlongo et al. Algorithms Mol Biol. 2009.

. 2009 Jan 30:4:3.

doi: 10.1186/1748-7188-4-3.

Authors

Pierre Peterlongo¹, Gustavo Akio Tominaga Sacomoto, Alair Pereira do Lago, Nadia Pisanti, Marie-France Sagot

Affiliation

¹ Equipe-projet Symbiose, IRISA/CNRS, Campus de Beaulieu, Rennes, France. pierre.peterlongo@irisa.fr

PMID: 19183438
PMCID: PMC2661881
DOI: 10.1186/1748-7188-4-3

Abstract

Background: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice.

Results: The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment.

Conclusion: To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.

PubMed Disclaimer

Figures

**Figure 1**
**A (L, d, 2)-repeat and a parallelogram**. An example a (L, d, 2)-repeat with L = 11, d = 2. Diagonals 30, 31, and 32 are shown. Among them, 30 and 32 have distance 2, while 30 and 31 (as well as 31 and 32) are consecutive. Assuming that q = 2, a q-hit is represented by a thicker diagonal of length 2 plus a small black circle representing its pair of coordinates. The q-hit (19, 49) refers to the q-gram TA, and 19 (resp. 49) is its first (resp. second) projection. The q-hit (17, 49) refers to the same q-gram TA but has a different first projection. The words inside the grey boxes are two distinct fragments of the same sequence s, namely s[10, 20] and s [42, 52]; they have length 11, and their edit distance is 2. We obtain one word from the other by deleting s[13] – hence no q-hits in positions 12, 13 – and by inserting s [48] – no q-hits in positions 47, 48. We have p = 6 and the set $S$ of 7 q-hits in diagonals 31 and 32 satisfies the properties (1), (2), (3), (4) and (5). If we add the q-hit in diagonal 30 in order to obtain a new set $S^{'}$ , properties (1), (2) still hold, but properties (3), (4) and (5) are no longer satisfied.

**Figure 2**
**Detection of (L, d, r)-repeats and two overlapping parallelograms**. Two parallelograms that overlap. The dark grey parallelogram in the figure detects q-hits between w = s[a, a + L - 1] and any word w_k= s[i, j] with i ∈ [c + a, c + d + a] and j ∈ [c + a + L - 1, c + d + a + L - 1]. The word z = s[c + a, c + d + a + L - 1] of length L + d contains the word w_kwhich in turn contains the word x = s[c + d + a, c + a + L - 1] of length L - d. Analogously for the light gray parallelogram, the word z' = s[c' + a, c' + d + a + L - 1] of length L + d contains a word ${w^{'}}_{k}$ which in turn contains the word x' = s[c' + d + a, c' + a + L - 1] of length L - d. The words w_kand ${w^{'}}_{k}$ are not shown because their length is variable. They necessarily overlap because they both contain the word p = s[c' + d + a, c + a + L - 1], that is the overlap between x and x'.

**Figure 3**
**Enlarging parallelograms from d + 1 to d + b diagonals**. Four parallelograms with b + d diagonals, starting in diagonals kb, for any integer k. Example with d = 3, b = 4. Notice that two consecutive enlarged parallelograms of this form share d common diagonals and that any parallelogram with d + 1 diagonals is contained in one such enlarged parallelogram with b + d diagonals.

**Figure 4**
**Application to multiple sequences**. Application of TUIUIU* to multiple sequences. For a sliding window w on a sequence s_i, parallelograms are tested on all other sequences. In this example, we assumed we found three fine/good/excellent parallelograms among four sequences.

**Figure 5**
**Tests on random generated sequences**. Application of the three versions of TUIUIU with parameters (L, d, r) = (1000, 100, 5) on five random sequences of a total size of 1.5 Mb, each containing approximate occurrences of a planted repeat of length 1 kb. We planted repeats whose occurrences have a pairwise maximum distance that ranges from 0 to 300 (each test has the same value for all pairs). Each test was performed 20 times: the average result is reported.

**Figure 6**
**Influence of q-gram size q over selectiveness and running time**. Influence of q-gram size q over selectiveness and running time for Human Chromosome 22 with parameters (L, d, r) = (260,13,280). Variant EXCELLENT gets slower and less selective if we increase q from q = 12 on.

**Figure 7**
**Influence of number of repeats r over selectiveness and running time**. Influence of number of repeats r over selectiveness and running time. Human Chromosome 22 with parameters (L, d, r) = (260, 13, r) with q = 14. The selectiveness of 0 obtained for methods GOOD and EXCELLENT are not drawn on the log scale. As r grows, the selectiveness decreases because more frequent repeats are rarer than the less frequent ones. Also, variant FINE gets less selective as r increases. Moreover, this illustrates a case in which EXCELLENT is time consuming while not bringing an improvement on the selectiveness with respect to GOOD as we could see also in Figure 5 and Table 2.

**Figure 8**
**Influence of q-gram size q over selectiveness and running time**. Influence of q-gram size q over selectiveness and running time for CFTR dataset with parameters (L, d, r) = (100, 12, 5). This test shows that EXCELLENT is essential when using a small q, which enables to filter for a high error rate such as 12%. For instance, with q = 3, EXCELLENT reduces the selectiveness of 100% observed for both FINE and GOOD to 0.01%.

**Figure 9**
**Influence of maximal error d over selectiveness and running time**. Influence of maximal error d over selectiveness and running time for CFTR dataset with parameters (L, d, r) = (100, d, 5) with q = 6. A selectiveness of 0 is not drawn because of the log scale.

See this image and copyright information in PMC

References

1. Lowe CB, Bejerano G, Haussler D. Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. National Academy of Sciences. 2007;104:8005–8010. - PMC - PubMed
1. Ugarkovic D. Functional elements residing within satellite DNA. EMBO reports. 2005;6:1035–1039. - PMC - PubMed
1. Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A, Platt D. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. National Academy of Science. 2006;103:6605–6610. - PMC - PubMed
1. Burkhardt S, Crauser A, Ferragina P, Lenhof HP, Vingron M. Proceedings of the third annual international conference on Computational molecular biology (Recomb 99) ACM Press; 1999. q-Gram Based Database Searching Using a Suffix Array (QUASAR)
1. Pevzner P, Waterman M. Multiple Filtration and Approximate Pattern Matching. Algorithmica. 1995;13:135–154.

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Lossless filter for multiple repeats with bounded edit distance

Affiliation

Lossless filter for multiple repeats with bounded edit distance

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Molecular Biology Databases