Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan 30:4:3.
doi: 10.1186/1748-7188-4-3.

Lossless filter for multiple repeats with bounded edit distance

Affiliations

Lossless filter for multiple repeats with bounded edit distance

Pierre Peterlongo et al. Algorithms Mol Biol. .

Abstract

Background: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice.

Results: The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment.

Conclusion: To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A (L, d, 2)-repeat and a parallelogram. An example a (L, d, 2)-repeat with L = 11, d = 2. Diagonals 30, 31, and 32 are shown. Among them, 30 and 32 have distance 2, while 30 and 31 (as well as 31 and 32) are consecutive. Assuming that q = 2, a q-hit is represented by a thicker diagonal of length 2 plus a small black circle representing its pair of coordinates. The q-hit (19, 49) refers to the q-gram TA, and 19 (resp. 49) is its first (resp. second) projection. The q-hit (17, 49) refers to the same q-gram TA but has a different first projection. The words inside the grey boxes are two distinct fragments of the same sequence s, namely s[10, 20] and s [42, 52]; they have length 11, and their edit distance is 2. We obtain one word from the other by deleting s[13] – hence no q-hits in positions 12, 13 – and by inserting s [48] – no q-hits in positions 47, 48. We have p = 6 and the set S of 7 q-hits in diagonals 31 and 32 satisfies the properties (1), (2), (3), (4) and (5). If we add the q-hit in diagonal 30 in order to obtain a new set S, properties (1), (2) still hold, but properties (3), (4) and (5) are no longer satisfied.
Figure 2
Figure 2
Detection of (L, d, r)-repeats and two overlapping parallelograms. Two parallelograms that overlap. The dark grey parallelogram in the figure detects q-hits between w = s[a, a + L - 1] and any word wk = s[i, j] with i ∈ [c + a, c + d + a] and j ∈ [c + a + L - 1, c + d + a + L - 1]. The word z = s[c + a, c + d + a + L - 1] of length L + d contains the word wk which in turn contains the word x = s[c + d + a, c + a + L - 1] of length L - d. Analogously for the light gray parallelogram, the word z' = s[c' + a, c' + d + a + L - 1] of length L + d contains a word wk which in turn contains the word x' = s[c' + d + a, c' + a + L - 1] of length L - d. The words wk and wk are not shown because their length is variable. They necessarily overlap because they both contain the word p = s[c' + d + a, c + a + L - 1], that is the overlap between x and x'.
Figure 3
Figure 3
Enlarging parallelograms from d + 1 to d + b diagonals. Four parallelograms with b + d diagonals, starting in diagonals kb, for any integer k. Example with d = 3, b = 4. Notice that two consecutive enlarged parallelograms of this form share d common diagonals and that any parallelogram with d + 1 diagonals is contained in one such enlarged parallelogram with b + d diagonals.
Figure 4
Figure 4
Application to multiple sequences. Application of TUIUIU* to multiple sequences. For a sliding window w on a sequence si, parallelograms are tested on all other sequences. In this example, we assumed we found three fine/good/excellent parallelograms among four sequences.
Figure 5
Figure 5
Tests on random generated sequences. Application of the three versions of TUIUIU with parameters (L, d, r) = (1000, 100, 5) on five random sequences of a total size of 1.5 Mb, each containing approximate occurrences of a planted repeat of length 1 kb. We planted repeats whose occurrences have a pairwise maximum distance that ranges from 0 to 300 (each test has the same value for all pairs). Each test was performed 20 times: the average result is reported.
Figure 6
Figure 6
Influence of q-gram size q over selectiveness and running time. Influence of q-gram size q over selectiveness and running time for Human Chromosome 22 with parameters (L, d, r) = (260,13,280). Variant EXCELLENT gets slower and less selective if we increase q from q = 12 on.
Figure 7
Figure 7
Influence of number of repeats r over selectiveness and running time. Influence of number of repeats r over selectiveness and running time. Human Chromosome 22 with parameters (L, d, r) = (260, 13, r) with q = 14. The selectiveness of 0 obtained for methods GOOD and EXCELLENT are not drawn on the log scale. As r grows, the selectiveness decreases because more frequent repeats are rarer than the less frequent ones. Also, variant FINE gets less selective as r increases. Moreover, this illustrates a case in which EXCELLENT is time consuming while not bringing an improvement on the selectiveness with respect to GOOD as we could see also in Figure 5 and Table 2.
Figure 8
Figure 8
Influence of q-gram size q over selectiveness and running time. Influence of q-gram size q over selectiveness and running time for CFTR dataset with parameters (L, d, r) = (100, 12, 5). This test shows that EXCELLENT is essential when using a small q, which enables to filter for a high error rate such as 12%. For instance, with q = 3, EXCELLENT reduces the selectiveness of 100% observed for both FINE and GOOD to 0.01%.
Figure 9
Figure 9
Influence of maximal error d over selectiveness and running time. Influence of maximal error d over selectiveness and running time for CFTR dataset with parameters (L, d, r) = (100, d, 5) with q = 6. A selectiveness of 0 is not drawn because of the log scale.

References

    1. Lowe CB, Bejerano G, Haussler D. Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. National Academy of Sciences. 2007;104:8005–8010. - PMC - PubMed
    1. Ugarkovic D. Functional elements residing within satellite DNA. EMBO reports. 2005;6:1035–1039. - PMC - PubMed
    1. Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A, Platt D. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. National Academy of Science. 2006;103:6605–6610. - PMC - PubMed
    1. Burkhardt S, Crauser A, Ferragina P, Lenhof HP, Vingron M. Proceedings of the third annual international conference on Computational molecular biology (Recomb 99) ACM Press; 1999. q-Gram Based Database Searching Using a Suffix Array (QUASAR)
    1. Pevzner P, Waterman M. Multiple Filtration and Approximate Pattern Matching. Algorithmica. 1995;13:135–154.