. 2005 Sep 2;33(15):4899-913.

doi: 10.1093/nar/gki791. Print 2005.

Limitations and potentials of current motif discovery algorithms

Jianjun Hu¹, Bin Li, Daisuke Kihara

Affiliations

PMID: 16284194
PMCID: PMC1199555
DOI: 10.1093/nar/gki791

Limitations and potentials of current motif discovery algorithms

Jianjun Hu et al. Nucleic Acids Res. 2005.

. 2005 Sep 2;33(15):4899-913.

doi: 10.1093/nar/gki791. Print 2005.

Authors

Jianjun Hu¹, Bin Li, Daisuke Kihara

Affiliation

¹ Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN 47907, USA.

PMID: 16284194
PMCID: PMC1199555
DOI: 10.1093/nar/gki791

Abstract

Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6-45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them.

PubMed Disclaimer

Figures

**Figure 1**
Two types of generated input sequences. Target binding site position information comes from ecoli.regulonDB, gene information from ecoli.genes, and genome information from ecoli.genome.

**Figure 2**
Statistics of the ECRDB62A dataset. (a) Distribution of the number of sequences for a binding site group; (b) distribution of the number of sites per sequence.

**Figure 3**
A simple consensus ensemble algorithm. Top predictions from multiple runs are aligned together to determine the boundary of the prospective motif based on over-representation. Then, a squeezing/expansion procedure will be applied to extract a motif prediction of a specified motif width starting from the center of the boundary region.

**Figure 4**
Measures of prediction accuracy at the nucleotide and motif levels. Accuracy scores over an input sequence set are the average accuracy scores over all its sequences. The overall accuracy scores of a motif discovery algorithm are the average accuracy scores over all M input sequence sets.

**Figure 5**
An example of binding site misalignment in a motif in RegulonDB. The shaded columns are those with at least 80% dominance of a certain nucleotide. (a) Original binding sites of motif TrpR; (b) the shifted binding site with maximum shift of four positions to maximize the number of consensus positions.

**Figure 6**
Scalability in terms of Performance coefficient (PC) with respect to the input sequence length (margin size). (a) nPC at nucleotide level; (b) sPC at binding site level.

**Figure 7**
Motif level success rate (mSr) with respect to the sequence length (margin size).

**Figure 8**
The nucleotide level prediction accuracy in terms of sensitivity (nSn) and specificity (nSp) with respect to the sequence lengths (margin sizes). (a) nSn at nucleotide level; (b) nSp at nucleotide level.

**Figure 9**
Comparison of prediction performance in terms of the number of input sequences in a dataset. The margin size is 200. (a) Nucleotide site level accuracy (nPC); (b) Binding site level accuracy (sPC).

**Figure 10**
Correlation between motif significance scores and performance coefficient scores of MDScan.

**Figure 11**
Difference of the information content between two sequence sets. Aligned sequences S_clw and the expanded sequences S_ext; realigned motifs S_align and the original motifs S_motif.

See this image and copyright information in PMC

References

1. Wyrick J.J., Young R.A. Deciphering gene expression regulatory networks. Curr. Opin. Genet. Dev. 2002;12:130–136. - PubMed
1. Duret L., Bucher P. Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol. 1997;7:399–406. - PubMed
1. Simon I., Barnett J., Hannett N., Harbison C.T., Rinaldi N.J., Volkert T.L., Wyrick J.J., Zeitlinger J., Gifford D.K., Jaakkola T.S., Young R.A. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001;106:697–708. - PubMed
1. Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein D., Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998;9:3273–3297. - PMC - PubMed
1. Brazma A., Jonassen I., Vilo J., Ukkonen E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 1998;8:1202–1215. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Limitations and potentials of current motif discovery algorithms

Affiliation

Limitations and potentials of current motif discovery algorithms

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources