Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Sep 2;33(15):4899-913.
doi: 10.1093/nar/gki791. Print 2005.

Limitations and potentials of current motif discovery algorithms

Affiliations

Limitations and potentials of current motif discovery algorithms

Jianjun Hu et al. Nucleic Acids Res. .

Abstract

Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6-45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Two types of generated input sequences. Target binding site position information comes from ecoli.regulonDB, gene information from ecoli.genes, and genome information from ecoli.genome.
Figure 2
Figure 2
Statistics of the ECRDB62A dataset. (a) Distribution of the number of sequences for a binding site group; (b) distribution of the number of sites per sequence.
Figure 3
Figure 3
A simple consensus ensemble algorithm. Top predictions from multiple runs are aligned together to determine the boundary of the prospective motif based on over-representation. Then, a squeezing/expansion procedure will be applied to extract a motif prediction of a specified motif width starting from the center of the boundary region.
Figure 4
Figure 4
Measures of prediction accuracy at the nucleotide and motif levels. Accuracy scores over an input sequence set are the average accuracy scores over all its sequences. The overall accuracy scores of a motif discovery algorithm are the average accuracy scores over all M input sequence sets.
Figure 5
Figure 5
An example of binding site misalignment in a motif in RegulonDB. The shaded columns are those with at least 80% dominance of a certain nucleotide. (a) Original binding sites of motif TrpR; (b) the shifted binding site with maximum shift of four positions to maximize the number of consensus positions.
Figure 6
Figure 6
Scalability in terms of Performance coefficient (PC) with respect to the input sequence length (margin size). (a) nPC at nucleotide level; (b) sPC at binding site level.
Figure 7
Figure 7
Motif level success rate (mSr) with respect to the sequence length (margin size).
Figure 8
Figure 8
The nucleotide level prediction accuracy in terms of sensitivity (nSn) and specificity (nSp) with respect to the sequence lengths (margin sizes). (a) nSn at nucleotide level; (b) nSp at nucleotide level.
Figure 9
Figure 9
Comparison of prediction performance in terms of the number of input sequences in a dataset. The margin size is 200. (a) Nucleotide site level accuracy (nPC); (b) Binding site level accuracy (sPC).
Figure 10
Figure 10
Correlation between motif significance scores and performance coefficient scores of MDScan.
Figure 11
Figure 11
Difference of the information content between two sequence sets. Aligned sequences Sclw and the expanded sequences Sext; realigned motifs Salign and the original motifs Smotif.

References

    1. Wyrick J.J., Young R.A. Deciphering gene expression regulatory networks. Curr. Opin. Genet. Dev. 2002;12:130–136. - PubMed
    1. Duret L., Bucher P. Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol. 1997;7:399–406. - PubMed
    1. Simon I., Barnett J., Hannett N., Harbison C.T., Rinaldi N.J., Volkert T.L., Wyrick J.J., Zeitlinger J., Gifford D.K., Jaakkola T.S., Young R.A. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001;106:697–708. - PubMed
    1. Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein D., Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998;9:3273–3297. - PMC - PubMed
    1. Brazma A., Jonassen I., Vilo J., Ukkonen E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 1998;8:1202–1215. - PMC - PubMed

Publication types