Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May 1;16(3):171-180.
doi: 10.1093/bfgp/elw030.

Modeling protein-DNA binding via high-throughput in vitro technologies

Modeling protein-DNA binding via high-throughput in vitro technologies

Yaron Orenstein et al. Brief Funct Genomics. .

Abstract

Protein-DNA binding plays a central role in gene regulation and by that in all processes in the living cell. Novel experimental and computational approaches facilitate better understanding of protein-DNA binding preferences via high-throughput measurement of protein binding to a large number of DNA sequences and inference of binding models from them. Here we review the state of the art in measuring protein-DNA binding in vitro, emphasizing the advantages and limitations of different technologies. In addition, we describe models for representing protein-DNA binding preferences and key computational approaches to learn those from high-throughput data. Using large experimental data sets, we test the performance of different models based on different measuring techniques. We conclude with pertinent open problems.

Keywords: high-throughput SELEX; motif finding; protein-binding microarrays; protein–DNA binding.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
High-throughput in vitro technologies for measuring protein–DNA binding. (A) Overview of universal PBMs. These arrays are designed to include all possible DNA 10-mers in 36 bp long probe sequences. Protein binding intensity is measured using a fluorescent antibody. The experimental output is a list of > 41000 probe sequences and the binding intensity of the protein to each. [Adapted by permission from Macmillan Publishers Ltd: Nature Biotechnology [16], copyright (2006)]. (B) Overview of HT-SELEX. Each experiment starts from a random set of fixed-length oligonucleotides (lengths vary from 10 to 40 bp). The protein binds its BSs in the pool and the bound oligonucleotides are extracted and amplified. Some of these oligonucleotides are sequenced, and the rest are used in reiteration of the process. The experimental output is several sequence files, one per iteration. [Adapted by permission from CSHL Press: Genome Research [18], copyright (2010).]
Figure 2.
Figure 2.
Models for protein–DNA binding preferences. (A) Position weight matrix. The matrix represents scores for each nucleotide position (eight in the example). The PWM logo plots the nucleotides by their weights and position entropy. The PWM and logo are of protein ATF1 (downloaded from CIS-BP [14], motif id M0295_1.02, generated by PWM-Align-Z). (B) K-mer model. Each k-mer is assigned a binding score. The example shows the top 16 8-mers and their E-scores for protein ATF1 (downloaded from CIS-BP, M0295_1.02). Note that multiple 8-mers correspond to different windows with respect to ATF1 ‘core’ motif (TGACGT). (C) Performance of PWM and k-mer models in in vitro prediction. For each paired PBM experiment (two experiments performed with the same TF using two arrays, each with a different probe design), a model was trained on data of one array and tested on the other. The AUC score reflects the accuracy in ranking positive probes at the top (see [37] for definitions). Two hundred fourteen experiments (covering 98 proteins) were included in the comparison. (D) Performance of PWM and k-mer models in in vivo prediction. For each ChIP-seq experiment, a model was trained on PBM data of the same TF and tested in ranking the top 500 peaks higher than a set of 500 control sequences from nearby genomic regions (see [37] for details). One hundred thirty-seven ChIP-seq experiments, covering 20 TFs, were used in the comparison. In C and D, gray lines stand for ±1 standard deviation (std) of AUC difference. P-values were calculated using Wilcoxon rank-sum paired test.
Figure 3.
Figure 3.
In vitro models predict in vivo binding. (A) Performance of PBM-derived models in predicting in vivo and in vitro binding. Boxplots of AUC for the models inferred by different methods show that in vivo binding prediction using in vitro models is less accurate than in vitro binding prediction by the same models (P-value < 10 7 for each of the four methods, Wilcoxon rank-sum unpaired test). AUC values for in vitro were calculated for 355 paired PBM experiments (results taken from [37]), and for in vivo for 20 TFs tested by ChIP-seq experiments (downloaded from ENCODE). (B) Comparison of PBM- and HT-SELEX-derived models in predicting in vivo binding. Left: HT-SELEX models are more accurate on some proteins in predicting in vivo binding. In computing the significance, for each protein, the values for different experiments of that protein were averaged to avoid dependencies. (Without such collapsing of the data, the HT-SELEX advantage is significant, giving P  =  0.006.) Right: When only the eight most informative positions are used for modeling each TF, models are less accurate than full models. One hundred sixty-seven ENCODE ChIP-seq experiments covering 28 different TFs were used to gauge the accuracy of in vitro binding models. In vitro models were downloaded from CIS-BP. (C) Logos of some PBM- and HT-SELEX-derived models. RXRA (for which PBM is more accurate in in vivo prediction), MAFK (for which HT-SELEX is more accurate) and ELK4 (where the methods perform similarly).

References

    1. Walz A, Pirrotta V.. Sequence of the PR promoter of phage lambda. Nature 1975;254:118–21. - PubMed
    1. Dynan WS. Modularity in promoters and enhancers. Cell 1989;58:1–4. - PubMed
    1. Hardison RC, Taylor J.. Genomic approaches towards finding cis-regulatory modules in animals. Nat. Rev. Genet 2012;13:469–83. - PMC - PubMed
    1. Siggers T, Gordân R.. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res 2014;42:2099–111. - PMC - PubMed
    1. Slattery M, Zhou T, Yang L, et al. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014;39:381–99. - PMC - PubMed

LinkOut - more resources