Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jul 13:7:342.
doi: 10.1186/1471-2105-7-342.

EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences

Affiliations

EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences

Jianjun Hu et al. BMC Bioinformatics. .

Abstract

Background: Understanding gene regulatory networks has become one of the central research problems in bioinformatics. More than thirty algorithms have been proposed to identify DNA regulatory sites during the past thirty years. However, the prediction accuracy of these algorithms is still quite low. Ensemble algorithms have emerged as an effective strategy in bioinformatics for improving the prediction accuracy by exploiting the synergetic prediction capability of multiple algorithms.

Results: We proposed a novel clustering-based ensemble algorithm named EMD for de novo motif discovery by combining multiple predictions from multiple runs of one or more base component algorithms. The ensemble approach is applied to the motif discovery problem for the first time. The algorithm is tested on a benchmark dataset generated from E. coli RegulonDB. The EMD algorithm has achieved 22.4% improvement in terms of the nucleotide level prediction accuracy over the best stand-alone component algorithm. The advantage of the EMD algorithm is more significant for shorter input sequences, but most importantly, it always outperforms or at least stays at the same performance level of the stand-alone component algorithms even for longer sequences.

Conclusion: We proposed an ensemble approach for the motif discovery problem by taking advantage of the availability of a large number of motif discovery programs. We have shown that the ensemble approach is an effective strategy for improving both sensitivity and specificity, thus the accuracy of the prediction. The advantage of the EMD algorithm is its flexibility in the sense that a new powerful algorithm can be easily added to the system.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the EMD algorithm. After each component algorithm is run R times for an input sequence data set, K motifs are collected from each run. The right side of the figure illustrates the grouping phase of the algorithm for the sequence number 1 and the final prediction of sites for the site group number 1 of the input sequence No. 1. See the text for the details.
Figure 2
Figure 2
Scalability of the EMD algorithms. The nucleotide level prediction performance was compared with the best base algorithm MDscan (MD) and the best multi-restart algorithm (RS-BP). This evaluation is done on ECRDB61B-200 data set. The y-axis shows the nucleotide level accuracy (nPC). The error bars are not shown because the standard error is very small (less than 0.003).

Similar articles

Cited by

References

    1. Brazma A, Jonassen I, Vilo J, Ukkonen E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 1998;8:1202–1215. - PMC - PubMed
    1. Holstege FC, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ, Green MR, Golub TR, Lander ES, Young RA. Dissecting the regulatory circuitry of a eukaryotic genome. Cell. 1998;95:717–728. doi: 10.1016/S0092-8674(00)81641-4. - DOI - PubMed
    1. Wyrick JJ, Young RA. Deciphering gene expression regulatory networks. Curr Opin Genet Dev. 2002;12:130–136. doi: 10.1016/S0959-437X(02)00277-0. - DOI - PubMed
    1. Tompa M, Li N, Bailey TL, Church GM, De MB, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van HJ, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. doi: 10.1038/nbt1053. - DOI - PubMed
    1. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–287. doi: 10.1038/nrg1315. - DOI - PubMed

Publication types

LinkOut - more resources