Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

doi:10.1186/s12859-015-0573-5

. 2015 May 1:16:140.

doi: 10.1186/s12859-015-0573-5.

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Michal Dabrowski¹, Norbert Dojer², Izabella Krystkowiak³, Bozena Kaminska⁴, Bartek Wilczynski⁵

Affiliations

¹ Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland. m.dabrowski@nencki.gov.pl.
² Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland. dojer@mimuw.edu.pl.
³ Laboratory of Molecular Neurobiology, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland. i.krystkowiak@nencki.gov.pl.
⁴ Laboratory of Molecular Neurobiology, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland. b.kaminska@nencki.gov.pl.
⁵ Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland. bartek@mimuw.edu.pl.

PMID: 25927199
PMCID: PMC4436866
DOI: 10.1186/s12859-015-0573-5

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Michal Dabrowski et al. BMC Bioinformatics. 2015.

. 2015 May 1:16:140.

doi: 10.1186/s12859-015-0573-5.

Authors

Michal Dabrowski¹, Norbert Dojer², Izabella Krystkowiak³, Bozena Kaminska⁴, Bartek Wilczynski⁵

Affiliations

¹ Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland. m.dabrowski@nencki.gov.pl.
² Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland. dojer@mimuw.edu.pl.
³ Laboratory of Molecular Neurobiology, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland. i.krystkowiak@nencki.gov.pl.
⁴ Laboratory of Molecular Neurobiology, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland. b.kaminska@nencki.gov.pl.
⁵ Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland. bartek@mimuw.edu.pl.

PMID: 25927199
PMCID: PMC4436866
DOI: 10.1186/s12859-015-0573-5

Abstract

Background: For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter.

Results: We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate.

Conclusion: We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/ .

PubMed Disclaimer

Figures

**Figure 1**
Comparison of performance of the dedicated commercial and public scanners. Shown are the average specificity and sensitivity +/−SD, for each tested database/scanner. MatIspector **(A, C)** or Match **(B, D)** were each separately compared to both matrix-scan and Bio.Motif; with either 3-rd exons **(A, B)** or flanks of the ChIP-seq peaks **(C, D)** used as the negative datasets. The color encodes scanners: matrix-scan (red), Bio.Motif (magenta), Match (green), MatInspector (blue). Stright lines through the points of average performance are the lines of equal balanced accuracy. Gray ovals in A, D mark the performance obtained with Genomatix motif families.

**Figure 2**
Comparsion of coverage of human TFs by motif databases.A. The numbers of distinct genes (Entrez Gene ID) assigned to all the vertebrate motifs from the indicated databases. For MatBase the number of TFs as provided by Genomatix is represented. B. The Venn diagram showing the overlap between human TF genes represented in the union of all the public databases and in the Transfac database. C. Similar as in B, but for human 81 human TFs represented in Ensembl 71 funcgen is based on MatBase v.9.0.

**Figure 3**
AUC distributions in motif databases. Consecutive plots present distributions of AUC calculated with respect to various negative datasets, as indicated by plots’ titles. For each motif the best related TF was selected.

**Figure 4**
Balanced accuracies for various approaches to threshold selection. Top row: balanced accuracy vs threshold parameter. Colors represent motif information content: from blue (low), through green and yellow to beige (high). Vertical black lines indicate optimal thresholds, black circles indicate corresponding average balanced accuracies. Bottom row shows how (sub-)optimal parameter values of a motif (X-axis) depends on its information content. For each motif, a circle represents parameter value yielding maximal balanced accuracy and a horizontal line represents a parameter range, for which BA is at least 95% of the maximum. Colors represent motif AUC: from green (low), through yellow to red (high). Balanced accuracies are calculated with respect to negative sequences composed of flanks of ChIP-seq peaks.

**Figure 5**
Balanced accuracy versus the FPR threshold for various AUC > 0.6, AUC > 0.7, AUC > 0.8, AUC > 0.9, cutoffs. Colors etc. as on Figure 4, top row.

See this image and copyright information in PMC

Cited by

Integrated analysis of motif activity and gene expression changes of transcription factors.
Madsen JGS, Rauch A, Van Hauwaert EL, Schmidt SF, Winnefeld M, Mandrup S. Madsen JGS, et al. Genome Res. 2018 Feb;28(2):243-255. doi: 10.1101/gr.227231.117. Epub 2017 Dec 12. Genome Res. 2018. PMID: 29233921 Free PMC article.
Target Finder of Transcription Factor (TFoTF): a novel tool to predict transcription factor-targeted genes in cancer.
Wang F, Xu X, Li X, Yuan J, Gao X, Wang C, Guan W, Xu G. Wang F, et al. Mol Oncol. 2023 Jul;17(7):1246-1262. doi: 10.1002/1878-0261.13388. Epub 2023 Feb 11. Mol Oncol. 2023. PMID: 36734611 Free PMC article.
Negative selection maintains transcription factor binding motifs in human cancer.
Vorontsov IE, Khimulya G, Lukianova EN, Nikolaeva DD, Eliseeva IA, Kulakovskiy IV, Makeev VJ. Vorontsov IE, et al. BMC Genomics. 2016 Jun 23;17 Suppl 2(Suppl 2):395. doi: 10.1186/s12864-016-2728-9. BMC Genomics. 2016. PMID: 27356864 Free PMC article.
A Multireporter Bacterial 2-Hybrid Assay for the High-Throughput and Dynamic Assay of PDZ Domain-Peptide Interactions.
Ichikawa DM, Corbi-Verge C, Shen MJ, Snider J, Wong V, Stagljar I, Kim PM, Noyes MB. Ichikawa DM, et al. ACS Synth Biol. 2019 May 17;8(5):918-928. doi: 10.1021/acssynbio.8b00499. Epub 2019 Apr 18. ACS Synth Biol. 2019. PMID: 30969105 Free PMC article.
Bioinformatic Prediction and High Throughput In Vivo Screening to Identify Cis-Regulatory Elements for the Development of Algal Synthetic Promoters.
Torres-Tiji Y, Sethuram H, Gupta A, McCauley J, Dutra-Molino JV, Pathania R, Saxton L, Kang K, Hillson NJ, Mayfield SP. Torres-Tiji Y, et al. ACS Synth Biol. 2024 Jul 19;13(7):2150-2165. doi: 10.1021/acssynbio.4c00199. Epub 2024 Jul 10. ACS Synth Biol. 2024. PMID: 38986010 Free PMC article.

See all "Cited by" articles

References

1. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986;188(3):415–31. doi: 10.1016/0022-2836(86)90165-8. - DOI - PubMed
1. Xing EP, Jordan MI, Karp RM, Russell S. A hierarchical bayesian markovian model for motifs in biopolymer sequences. In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. Vancouver Canada: MIT Press; 2003.
1. Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012;191(3):781–90. doi: 10.1534/genetics.112.138685. - DOI - PMC - PubMed
1. Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014;42(Database issue):148–55. doi: 10.1093/nar/gkt1087. - DOI - PMC - PubMed
1. Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotech. 2011;29(6):480–3. doi: 10.1038/nbt.1893. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

[1] Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986;188(3):415–31. doi: 10.1016/0022-2836(86)90165-8. - DOI - PubMed

[2] Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986;188(3):415–31. doi: 10.1016/0022-2836(86)90165-8. - DOI - PubMed

[3] Xing EP, Jordan MI, Karp RM, Russell S. A hierarchical bayesian markovian model for motifs in biopolymer sequences. In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. Vancouver Canada: MIT Press; 2003.

[4] Xing EP, Jordan MI, Karp RM, Russell S. A hierarchical bayesian markovian model for motifs in biopolymer sequences. In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. Vancouver Canada: MIT Press; 2003.

[5] Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012;191(3):781–90. doi: 10.1534/genetics.112.138685. - DOI - PMC - PubMed

[6] Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012;191(3):781–90. doi: 10.1534/genetics.112.138685. - DOI - PMC - PubMed

[7] Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014;42(Database issue):148–55. doi: 10.1093/nar/gkt1087. - DOI - PMC - PubMed

[8] Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014;42(Database issue):148–55. doi: 10.1093/nar/gkt1087. - DOI - PMC - PubMed

[9] Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotech. 2011;29(6):480–3. doi: 10.1038/nbt.1893. - DOI - PMC - PubMed

[10] Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotech. 2011;29(6):480–3. doi: 10.1038/nbt.1893. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Affiliations

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous