. 2007 Jun 8:8:193.

doi: 10.1186/1471-2105-8-193.

Improved benchmarks for computational motif discovery

Geir Kjetil Sandve¹, Osman Abul, Vegard Walseng, Finn Drabløs

Affiliations

PMID: 17559676
PMCID: PMC1903367
DOI: 10.1186/1471-2105-8-193

Improved benchmarks for computational motif discovery

Geir Kjetil Sandve et al. BMC Bioinformatics. 2007.

. 2007 Jun 8:8:193.

doi: 10.1186/1471-2105-8-193.

Authors

Geir Kjetil Sandve¹, Osman Abul, Vegard Walseng, Finn Drabløs

Affiliation

¹ Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway. sandve@ntnu.no

PMID: 17559676
PMCID: PMC1903367
DOI: 10.1186/1471-2105-8-193

Abstract

Background: An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research.

Results: We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models.

Conclusion: Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks.

PubMed Disclaimer

Figures

**Figure 1**
**Discrimination on data sets by Tompa et al**. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence on data sets from Tompa *et al*. Data sets (x-axis) are sorted individually for each model in order of increasing nCC, making it easier to compare the overall distributions of discrimination scores.

**Figure 2**
**Discrimination on different data set versions by Tompa et al**. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence for the three motif models on real, generic and Markov versions of data sets.

**Figure 3**
**Motif discovery scores from Tompa et al**. nCC-scores of 14 motif discovery methods given in the Tompa assesment, compared to prediction and discrimination scores with the three main motif models.

**Figure 4**
**Motif discovery versus discrimination**. Scatter plot of maximum motif discovery score versus discrimination score for the 50 data sets in the suite by Tompa *et al*.

**Figure 5**
**MEME scores after removals or erasures**. Total MEME score if the data sets with highest or lowest discrimination scores, respectively, had been incrementally removed from the Tompa benchmark, as well as total MEME score if predictions on the data sets with lowest discrimination scores had been incrementally erased.

**Figure 6**
**Discrimination on synthetic data sets**. Discrimination nCC-scores for motif models on three variants of synthetic data sets: variable mutations (4/15), fixed mutations (4_of_15) and fixed mutations with instances in 75% of the sequences (4_of_15 zoops). For each variant, the scores of each model are averaged over 10 randomly generated data sets.

**Figure 7**
**Sequences per data set**. Distribution of number of sequences per data set.

**Figure 8**
**Discrimination on all TRANSFAC-based data sets**. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence on real and Markov version of TRANSFAC-based data sets. For each data set, the highest discrimination score achieved by any of the three motif models is selected. The distribution of scores are in sorted order for real and Markov versions independently.

**Figure 9**
**Discrimination on algorithm benchmark suite**. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence. Results are given for our algorithm benchmark suite and the suite by Tompa *et al*., for both real and Markov versions. For each data set, the highest discrimination score achieved by any of the three motif models is selected. The distribution of scores are in sorted order for all versions independently.

**Figure 10**
**Discrimination score and number of binding sites**. Distribution of discrimination scores (nCC) and number of binding sites for each of the 25 data sets in the model benchmark suite.

**Figure 11**
**Discrimination score on model benchmark suite**. Distribution of discrimination scores (nCC) for the 25 data sets in the model benchmark suite. One curve shows the best score of the three common motif models on the data set, while the other curve shows the score possible with a more expressive model that still do not consider the context of binding sites.

**Figure 12**
**Generating positive and negative examples**. A set of upstream DNA sequences for a transcription factor where a) m binding locations are identified, b) generating positive and negative examples.

See this image and copyright information in PMC

References

1. Sandve GK, Drabløs F. A survey of motif discovery methods in an integrated framework. Biol Direct. 2006;1 - PMC - PubMed
1. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296:1205–14. - PubMed
1. Bailey TL, Elkan C. The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol. 1995;3:21–9. - PubMed
1. Marsan L, Sagot MF. RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology. New York, NY, USA: ACM Press; 2000. Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification; pp. 210–219. - PubMed
1. Blanchette M, Tompa M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 2002;12:739–48. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improved benchmarks for computational motif discovery

Affiliation

Improved benchmarks for computational motif discovery

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources