. 2011 Feb 10;7(2):e1001070.

doi: 10.1371/journal.pcbi.1001070.

De-novo discovery of differentially abundant transcription factor binding sites including their positional preference

Jens Keilwagen¹, Jan Grau, Ivan A Paponov, Stefan Posch, Marc Strickert, Ivo Grosse

Affiliations

PMID: 21347314
PMCID: PMC3037384
DOI: 10.1371/journal.pcbi.1001070

De-novo discovery of differentially abundant transcription factor binding sites including their positional preference

Jens Keilwagen et al. PLoS Comput Biol. 2011.

. 2011 Feb 10;7(2):e1001070.

doi: 10.1371/journal.pcbi.1001070.

Authors

Jens Keilwagen¹, Jan Grau, Ivan A Paponov, Stefan Posch, Marc Strickert, Ivo Grosse

Affiliation

¹ Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany. Jens.Keilwagen@ipk-gatersleben.de

PMID: 21347314
PMCID: PMC3037384
DOI: 10.1371/journal.pcbi.1001070

Erratum in

PLoS Comput Biol. 2011 Oct;7(10). doi: 10.1371/annotation/a0b541dc-472b-4076-a435-499ce9519335 doi: 10.1371/annotation/a0b541dc-472b-4076-a435-499ce9519335

Abstract

Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Comparison of de-novo motif discovery tools on the metazoan compendium.**
Each column of the table presents the results for one motif discovery tool, and each column corresponds to one data set of the metazoan compendium. We indicate by a red cross that a motif is not found, and we indicate by a checkmark, that a motif is found by a specific tool. The color of the checkmarks represents the accuracy of the motif discovered as measured by the normalized euclidean distance , and we use the thresholds on the normalized euclidean distance as proposed by Linhart et al. . The symbol marks long execution times (h) that were aborted in . In the last row of the table, we report the total number of motifs discovered by each of the tools.

formula image — **Figure 1. Comparison of de-novo motif discovery tools on the metazoan compendium.**
Each column of the table presents the results for one motif discovery tool, and each column corresponds to one data set of the metazoan compendium. We indicate by a red cross that a motif is not found, and we indicate by a checkmark, that a motif is found by a specific tool. The color of the checkmarks represents the accuracy of the motif discovered as measured by the normalized euclidean distance , and we use the thresholds on the normalized euclidean distance as proposed by Linhart et al. . The symbol marks long execution times (h) that were aborted in . In the last row of the table, we report the total number of motifs discovered by each of the tools.

**Figure 2. Comparison of nucleotide precision recall curves for known and unknown motif length.**
Figure 2a) shows the nucleotide precision recall curves for the de-novo motif discovery tools provided with the correct motif length, and Figure 2b) shows the nucleotide precision recall curves for the de-novo motif discovery tools when the correct motif length is not provided but must be learned by the tools. For reasons of visual clarity, we do not plot the partial nucleotide precision recall curves of those tools with and below 0.1 for all available thresholds. These curves would be located in the lower left corner of both subfigures.

**Figure 3. Comparison of nucleotide precision recall curves for uniform and Gaussian position distribution.**
Figure 3a) shows the nucleotide precision recall curves for the de-novo motif discovery tools on the data set with uniformly placed MA0015 BSs, and Figure 3b) shows the nucleotide precision recall curves for the de-novo motif discovery tools on the data set with Gaussian distributed MA0015 BSs. Figure 3c) shows for both data sets the real distributions as histograms of start positions of the implanted BSs and the position distributions learned by Dispom. For reasons of visual clarity, we do not plot results located in the lower left corners of subfigures a) and b) (cf. Figure 2).

**Figure 4. Comparison of nucleotide precision recall curves with and without decoy motif.**
Figure 4a) shows the nucleotide precision recall curves for the de-novo motif discovery tools on the data set without implanted decoy motif, and Figure 4b) shows the nucleotide precision recall curves for the de-novo motif discovery tools on the data set with implanted decoy motif MA0052. For both subfigures, we do not plot results located in the left lower corner for reasons of clarity (cf. Figure 2).

**Figure 5. Overview of de-novo motif discovery results for Gaussian data sets and unknown motif length.**
Each column shows the results of one data set, and each row shows the results of one de-novo motif discovery tool. Each subfigure shows five bars that visualize the nucleotide precision for a nucleotide recall of 10%, 30%, 50%, 70%, and 90%, respectively, from left to right. Additionally, each subfigure contains gray horizontal lines for the nucleotide precision of and .

**Figure 6. Auxin-dependent motif and position distribution found by Dispom.**
Figure 6a) shows the sequence logo obtained from the predictions of Dispom and the corresponding consensus sequence, where S stands for C or G, and B stands for C, G, or T. Figure 6b) shows a histogram of the predicted start positions and the position distribution learned by Dispom (red line).

See this image and copyright information in PMC

References

1. Hellman LM, Fried MG. Electrophoretic mobility shift assay (EMSA) for detecting protein-nucleic acid interactions. Nat Protoc. 2007;2:1849–1861. - PMC - PubMed
1. Galas DJ, Schmitz A. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 1978;5:3157–3170. - PMC - PubMed
1. Benotmane AM, Hoylaerts MF, Collen D, Belayew A. Nonisotopic quantitative analysis of protein-DNA interactions at equilibrium. Anal Biochem. 1997;250:181–185. - PubMed
1. Mönke G, Altschmied L, Tewes A, Reidt W, Mock HP, et al. Seed-specific transcription factors ABI3 and FUS3: molecular interaction with DNA. Planta. 2004;219:158–166. - PubMed
1. Sun LV, Chen L, Greil F, Negre N, Li TR, et al. Protein-DNA interaction mapping using genomic tiling path microarrays in Drosophila. Proc Natl Acad Sci U S A. 2003;100:9428–9433. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

De-novo discovery of differentially abundant transcription factor binding sites including their positional preference

Affiliation

De-novo discovery of differentially abundant transcription factor binding sites including their positional preference

Authors

Affiliation

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous