Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jun 8:8:193.
doi: 10.1186/1471-2105-8-193.

Improved benchmarks for computational motif discovery

Affiliations

Improved benchmarks for computational motif discovery

Geir Kjetil Sandve et al. BMC Bioinformatics. .

Abstract

Background: An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research.

Results: We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models.

Conclusion: Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Discrimination on data sets by Tompa et al. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence on data sets from Tompa et al. Data sets (x-axis) are sorted individually for each model in order of increasing nCC, making it easier to compare the overall distributions of discrimination scores.
Figure 2
Figure 2
Discrimination on different data set versions by Tompa et al. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence for the three motif models on real, generic and Markov versions of data sets.
Figure 3
Figure 3
Motif discovery scores from Tompa et al. nCC-scores of 14 motif discovery methods given in the Tompa assesment, compared to prediction and discrimination scores with the three main motif models.
Figure 4
Figure 4
Motif discovery versus discrimination. Scatter plot of maximum motif discovery score versus discrimination score for the 50 data sets in the suite by Tompa et al.
Figure 5
Figure 5
MEME scores after removals or erasures. Total MEME score if the data sets with highest or lowest discrimination scores, respectively, had been incrementally removed from the Tompa benchmark, as well as total MEME score if predictions on the data sets with lowest discrimination scores had been incrementally erased.
Figure 6
Figure 6
Discrimination on synthetic data sets. Discrimination nCC-scores for motif models on three variants of synthetic data sets: variable mutations (4/15), fixed mutations (4_of_15) and fixed mutations with instances in 75% of the sequences (4_of_15 zoops). For each variant, the scores of each model are averaged over 10 randomly generated data sets.
Figure 7
Figure 7
Sequences per data set. Distribution of number of sequences per data set.
Figure 8
Figure 8
Discrimination on all TRANSFAC-based data sets. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence on real and Markov version of TRANSFAC-based data sets. For each data set, the highest discrimination score achieved by any of the three motif models is selected. The distribution of scores are in sorted order for real and Markov versions independently.
Figure 9
Figure 9
Discrimination on algorithm benchmark suite. Nucleotide-level CC-score for discrimination between binding sites and remaining sequence. Results are given for our algorithm benchmark suite and the suite by Tompa et al., for both real and Markov versions. For each data set, the highest discrimination score achieved by any of the three motif models is selected. The distribution of scores are in sorted order for all versions independently.
Figure 10
Figure 10
Discrimination score and number of binding sites. Distribution of discrimination scores (nCC) and number of binding sites for each of the 25 data sets in the model benchmark suite.
Figure 11
Figure 11
Discrimination score on model benchmark suite. Distribution of discrimination scores (nCC) for the 25 data sets in the model benchmark suite. One curve shows the best score of the three common motif models on the data set, while the other curve shows the score possible with a more expressive model that still do not consider the context of binding sites.
Figure 12
Figure 12
Generating positive and negative examples. A set of upstream DNA sequences for a transcription factor where a) m binding locations are identified, b) generating positive and negative examples.

References

    1. Sandve GK, Drabløs F. A survey of motif discovery methods in an integrated framework. Biol Direct. 2006;1 - PMC - PubMed
    1. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296:1205–14. - PubMed
    1. Bailey TL, Elkan C. The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol. 1995;3:21–9. - PubMed
    1. Marsan L, Sagot MF. RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology. New York, NY, USA: ACM Press; 2000. Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification; pp. 210–219. - PubMed
    1. Blanchette M, Tompa M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 2002;12:739–48. - PMC - PubMed

Publication types

Substances

LinkOut - more resources