Direct vs 2-stage approaches to structured motif finding

Maria Federico¹, Mauro Leoncini, Manuela Montangero, Paolo Valente

Affiliations

Affiliation

¹ Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy. leoncini@unimore.it.

PMID: 22908910
PMCID: PMC3564690
DOI: 10.1186/1748-7188-7-20

Direct vs 2-stage approaches to structured motif finding

Maria Federico et al. Algorithms Mol Biol. 2012.

. 2012 Aug 21;7(1):20.

doi: 10.1186/1748-7188-7-20.

Authors

Maria Federico¹, Mauro Leoncini, Manuela Montangero, Paolo Valente

Affiliation

¹ Dipartimento di Scienze Fisiche, Informatiche e Matematiche, Università di Modena e Reggio Emilia, 41125 Modena, Via Campi 213/b, Italy. leoncini@unimore.it.

PMID: 22908910
PMCID: PMC3564690
DOI: 10.1186/1748-7188-7-20

Abstract

Background: The notion of DNA motif is a mathematical abstraction used to model regions of the DNA (known as Transcription Factor Binding Sites, or TFBSs) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, DNA structured motifs are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or simple) motifs, separated by a variable, but somewhat constrained number of "irrelevant" base-pairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs.

Results: We describe a computational tool, named SISMA, for the de-novo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2-stage approach w.r.t. direct methods. In fact, no 2-stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior.

Conclusions: A reflection on the results obtained lead us to conclude that a 2-stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2-stage tools in parallel, halting the computations when the first halts.

PubMed Disclaimer

Figures

**Figure 1**
**Simple and structured motifs.** Part (a): some DNA sequences with instances of the two simple motifs GATAAG (one base substitution tolerated) and TATAAAA (up two base substitutions tolerated), highlighted in blue and red, respectively. Part (b): three instances of a structured motif (6,1)-[2,4]-(6,1)-[3,5] -(7,2).

**Figure 2**
**Risotto vs SISMA_Smile on synthetic datasets (a) Number of iterations in which SISMA_Smile outperforms Risotto or vice versa, and (b) number of tools failures.** Notice that when one tool fails, the other might end computation, hence, failures might not sum up to the total number of runs.

**Figure 3**
**Risotto’s and SISMA_Smile’s running times on synthetic datasets.** Worst and average run times and standard deviations (in seconds) for SISMA_Smile and Risotto. Average runtime and standard deviations have been computed omitting best and worst runs. (a) (b) (c) Running times of all runs are considered. (a’) (b’) (c’) Only running times of runs in which both tools end computation are considered. Notice that the scale on Y-axes is not the same for all charts.

**Figure 4**
**Examples: SISMA_Smile vs Risotto.** Running times of Risotto, SISMA_Smile’s stages 1 and 2, and of all SISMA_Smile’s list intersection step (during stage 2). A 0s time for list intersection means that the corresponding step took time smaller than timer resolution. The box index selection order during stage 2 is shown. In example (a) SISMA_Smile outperforms Risotto. Observe that SMILE is called once on the (10,2) pair, so that the time reported for the 6^th box is 0. In this example Risotto is 473 times slower than SISMA. In example (b) Risotto outperforms SISMA_Smile because the stage 1 performed by SMILE is slow due to the presence of a box for which a large number of simple motifs is found. In this case SISMA_Smile is 28 times slower than Risotto. In particular, the most time consuming task is the extraction of the (15,4) box (about 91.7%of total execution time), for which 21631 simple motifs are found.

**Figure 5**
**Running times on biological datasets.** Running times (in seconds) of (a) SISMA_Smile and Risotto (b) SISMA_Speller and ExMotif on biological datasets, when using an uniprocessor machine with 1GB of RAM.

See this image and copyright information in PMC

References

1. Watson JD, Baker TA, Bell SP, Gann A, Levine M, Losick R. Molecular Biology of the Gene. 6/e: Pearson International Edition; 2007.
1. Werner T. Models for prediction and recognition of eukaryotic promoters. Mammalian Genome. 1999;10:168–175. doi: 10.1007/s003359900963. - DOI - PubMed
1. Sinha S, Tompa M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002;30:5549–5560. doi: 10.1093/nar/gkf669. - DOI - PMC - PubMed
1. Lemon B, Tjian R. Orchestrated response: a symphony of transcription factors for gene control. Genes & Dev. 2000;14:2551–2569. doi: 10.1101/gad.831000. - DOI - PubMed
1. Wray GA. The evolutionary significance of cis-regulatory mutations. Nature Rev Genet. 2007;8:206–216. - PubMed

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Direct vs 2-stage approaches to structured motif finding

Affiliation

Direct vs 2-stage approaches to structured motif finding

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials