BEESEM: estimation of binding energy models using HT-SELEX data

doi:10.1093/bioinformatics/btx191

. 2017 Aug 1;33(15):2288-2295.

doi: 10.1093/bioinformatics/btx191.

BEESEM: estimation of binding energy models using HT-SELEX data

Shuxiang Ruan¹, S Joshua Swamidass², Gary D Stormo¹

Affiliations

¹ Department of Genetics.
² Department of Pathology and Immunology, Washington University School of Medicine, St. Louis 63110, USA.

PMID: 28379348
PMCID: PMC5860122
DOI: 10.1093/bioinformatics/btx191

BEESEM: estimation of binding energy models using HT-SELEX data

Shuxiang Ruan et al. Bioinformatics. 2017.

. 2017 Aug 1;33(15):2288-2295.

doi: 10.1093/bioinformatics/btx191.

Authors

Shuxiang Ruan¹, S Joshua Swamidass², Gary D Stormo¹

Affiliations

¹ Department of Genetics.
² Department of Pathology and Immunology, Washington University School of Medicine, St. Louis 63110, USA.

PMID: 28379348
PMCID: PMC5860122
DOI: 10.1093/bioinformatics/btx191

Abstract

Motivation: Characterizing the binding specificities of transcription factors (TFs) is crucial to the study of gene expression regulation. Recently developed high-throughput experimental methods, including protein binding microarrays (PBM) and high-throughput SELEX (HT-SELEX), have enabled rapid measurements of the specificities for hundreds of TFs. However, few studies have developed efficient algorithms for estimating binding motifs based on HT-SELEX data. Also the simple method of constructing a position weight matrix (PWM) by comparing the frequency of the preferred sequence with single-nucleotide variants has the risk of generating motifs with higher information content than the true binding specificity.

Results: We developed an algorithm called BEESEM that builds on a comprehensive biophysical model of protein-DNA interactions, which is trained using the expectation maximization method. BEESEM is capable of selecting the optimal motif length and calculating the confidence intervals of estimated parameters. By comparing BEESEM with the published motifs estimated using the same HT-SELEX data, we demonstrate that BEESEM provides significant improvements. We also evaluate several motif discovery algorithms on independent PBM and ChIP-seq data. BEESEM provides significantly better fits to in vitro data, but its performance is similar to some other methods on in vivo data under the criterion of the area under the receiver operating characteristic curve (AUROC). This highlights the limitations of the purely rank-based AUROC criterion. Using quantitative binding data to assess models, however, demonstrates that BEESEM improves on prior models.

Availability and implementation: Freely available on the web at http://stormo.wustl.edu/resources.html .

Contact: stormo@wustl.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1**
The HT-SELEX experiments and the J2013 PWMs. (a) Most of the HT-SELEX experiments have 4 cycles. By convention, the 0th SELEX cycle denotes the initial library of randomly generated DNA probes. Multiple HT-SELEX experiments share the same initial library. 27 sequencing datasets corresponding to the 1st cycle are missing from the database. (b) In 80% of the datasets, the randomized region is 20 bp long. (c) The length of the J2013 PWMs ranges from 7 to 23; the mean length is 12.7 bp. (d) The average mean column information content of the J2013 PWMs is 1.20 bit. The information content is computed based on a uniform background distribution of the four nucleotides

**Fig. 2**
The results of the PBM and ChIP-seq evaluation tests. (a) In the PBM tests, the number of binding models tested is 67 for each algorithm. The error bars mark the standard deviation of the scores. The BEEML bar is singled out because the corresponding PWMs were trained on PBM data. For the other binding models trained on HT-SELEX data, the PBM test is an external validation on *in vitro* data. (b) In the ChIP-seq tests, the number of binding models tested is 72 for each algorithm. The error bars mark the standard deviation of the scores. The y axis starts from 0.5, the expected score of a random classifier. The DiMO bar is singled out because the corresponding PWMs were trained on ChIP-seq data. For the other binding models trained on HT-SELEX data, the ChIP-seq test is an external validation on *in vivo* data

**Fig. 3**
The median relative affinities (MRAs) predicted by different binding models. The number of PWMs tested is 73 for each algorithm. The rectangular bars mark the 50th percentile (the median) of the 73 MRAs for each algorithm, and the error bars mark the 5th and 95th percentiles. DeepBind is excluded from the HT-SELEX test because the output of its models cannot be interpreted as simple binding probabilities

See this image and copyright information in PMC

Cited by

DNA binding specificity of all four Saccharomyces cerevisiae forkhead transcription factors.
Cooper BH, Dantas Machado AC, Gan Y, Aparicio OM, Rohs R. Cooper BH, et al. Nucleic Acids Res. 2023 Jun 23;51(11):5621-5633. doi: 10.1093/nar/gkad372. Nucleic Acids Res. 2023. PMID: 37177995 Free PMC article.
PRIESSTESS: interpretable, high-performing models of the sequence and structure preferences of RNA-binding proteins.
Laverty KU, Jolma A, Pour SE, Zheng H, Ray D, Morris Q, Hughes TR. Laverty KU, et al. Nucleic Acids Res. 2022 Oct 28;50(19):e111. doi: 10.1093/nar/gkac694. Nucleic Acids Res. 2022. PMID: 36018788 Free PMC article.
Prediction of cooperative homeodomain DNA binding sites from high-throughput-SELEX data.
Cain B, Webb J, Yuan Z, Cheung D, Lim HW, Kovall RA, Weirauch MT, Gebelein B. Cain B, et al. Nucleic Acids Res. 2023 Jul 7;51(12):6055-6072. doi: 10.1093/nar/gkad318. Nucleic Acids Res. 2023. PMID: 37114997 Free PMC article.
SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site.
Zhang L, Martini GD, Rube HT, Kribelbauer JF, Rastogi C, FitzPatrick VD, Houtman JC, Bussemaker HJ, Pufall MA. Zhang L, et al. Genome Res. 2018 Jan;28(1):111-121. doi: 10.1101/gr.222844.117. Epub 2017 Dec 1. Genome Res. 2018. PMID: 29196557 Free PMC article.
Sharing DNA-binding information across structurally similar proteins enables accurate specificity determination.
Wetzel JL, Singh M. Wetzel JL, et al. Nucleic Acids Res. 2020 Jan 24;48(2):e9. doi: 10.1093/nar/gkz1087. Nucleic Acids Res. 2020. PMID: 31777934 Free PMC article.

See all "Cited by" articles

References

1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed
1. Atherton J. et al. (2012) A model for sequential evolution of ligands by exponential enrichment (SELEX) data. Ann. Appl. Stat., 6, 928–949.
1. Badis G. et al. (2008) A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol. Cell, 32, 878–887. - PMC - PubMed
1. Badis G. et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science, 324, 1720–1723. - PMC - PubMed
1. Berger M.F. et al. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429–1435. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed

[2] Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. - PubMed

[3] Atherton J. et al. (2012) A model for sequential evolution of ligands by exponential enrichment (SELEX) data. Ann. Appl. Stat., 6, 928–949.

[4] Atherton J. et al. (2012) A model for sequential evolution of ligands by exponential enrichment (SELEX) data. Ann. Appl. Stat., 6, 928–949.

[5] Badis G. et al. (2008) A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol. Cell, 32, 878–887. - PMC - PubMed

[6] Badis G. et al. (2008) A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol. Cell, 32, 878–887. - PMC - PubMed

[7] Badis G. et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science, 324, 1720–1723. - PMC - PubMed

[8] Badis G. et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science, 324, 1720–1723. - PMC - PubMed

[9] Berger M.F. et al. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429–1435. - PMC - PubMed

[10] Berger M.F. et al. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429–1435. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BEESEM: estimation of binding energy models using HT-SELEX data

Affiliations

BEESEM: estimation of binding energy models using HT-SELEX data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources