Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May 1:16:140.
doi: 10.1186/s12859-015-0573-5.

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Affiliations

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Michal Dabrowski et al. BMC Bioinformatics. .

Abstract

Background: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter.

Results: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate.

Conclusion: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/ .

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of performance of the dedicated commercial and public scanners. Shown are the average specificity and sensitivity +/−SD, for each tested database/scanner. MatIspector (A, C) or Match (B, D) were each separately compared to both matrix-scan and Bio.Motif; with either 3-rd exons (A, B) or flanks of the ChIP-seq peaks (C, D) used as the negative datasets. The color encodes scanners: matrix-scan (red), Bio.Motif (magenta), Match (green), MatInspector (blue). Stright lines through the points of average performance are the lines of equal balanced accuracy. Gray ovals in A, D mark the performance obtained with Genomatix motif families.
Figure 2
Figure 2
Comparsion of coverage of human TFs by motif databases.A. The numbers of distinct genes (Entrez Gene ID) assigned to all the vertebrate motifs from the indicated databases. For MatBase the number of TFs as provided by Genomatix is represented. B. The Venn diagram showing the overlap between human TF genes represented in the union of all the public databases and in the Transfac database. C. Similar as in B, but for human 81 human TFs represented in Ensembl 71 funcgen is based on MatBase v.9.0.
Figure 3
Figure 3
AUC distributions in motif databases. Consecutive plots present distributions of AUC calculated with respect to various negative datasets, as indicated by plots’ titles. For each motif the best related TF was selected.
Figure 4
Figure 4
Balanced accuracies for various approaches to threshold selection. Top row: balanced accuracy vs threshold parameter. Colors represent motif information content: from blue (low), through green and yellow to beige (high). Vertical black lines indicate optimal thresholds, black circles indicate corresponding average balanced accuracies. Bottom row shows how (sub-)optimal parameter values of a motif (X-axis) depends on its information content. For each motif, a circle represents parameter value yielding maximal balanced accuracy and a horizontal line represents a parameter range, for which BA is at least 95% of the maximum. Colors represent motif AUC: from green (low), through yellow to red (high). Balanced accuracies are calculated with respect to negative sequences composed of flanks of ChIP-seq peaks.
Figure 5
Figure 5
Balanced accuracy versus the FPR threshold for various AUC > 0.6, AUC > 0.7, AUC > 0.8, AUC > 0.9, cutoffs. Colors etc. as on Figure 4, top row.

Similar articles

Cited by

References

    1. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986;188(3):415–31. doi: 10.1016/0022-2836(86)90165-8. - DOI - PubMed
    1. Xing EP, Jordan MI, Karp RM, Russell S. A hierarchical bayesian markovian model for motifs in biopolymer sequences. In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. Vancouver Canada: MIT Press; 2003.
    1. Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012;191(3):781–90. doi: 10.1534/genetics.112.138685. - DOI - PMC - PubMed
    1. Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014;42(Database issue):148–55. doi: 10.1093/nar/gkt1087. - DOI - PMC - PubMed
    1. Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotech. 2011;29(6):480–3. doi: 10.1038/nbt.1893. - DOI - PMC - PubMed

Publication types

Substances