Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep 9;6(9):e1000916.
doi: 10.1371/journal.pcbi.1000916.

High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions

Affiliations

High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions

Phaedra Agius et al. PLoS Comput Biol. .

Abstract

Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Supervised learning of TF sequence specificities from protein binding microarrays.
In our approach, we directly learn the mapping from double-stranded DNA probe sequence to intensity in the PBM TF binding experiment by using support vector regression (SVR) together with novel formula image-mer based string kernels. Probe sequences containing high affinity binding sites have high intensity in the PBM binding experiment; such probes are shown bound by the fluorescently tagged TF (left) and are indicated by green points in the SVR training (right). The SVR predicts probe intensity from probe sequence composition. The trained SVRs can be used to scan intergenic regions to predict in vivo TF occupancy.
Figure 2
Figure 2. SVR models improve over E-scores and PSSMs for in vitro binding prediction.
(a) The scatter plot shows the detection of the top 100 probes using maximum E-scores (formula image-axis) and the SVR model (formula image-axis) in the prediction of in vitro TF binding preferences. Each point corresponds to one TF. The figure contains 37 yeast TFs from , 33 yeast TFs from (blue), and 114 mouse TFs from (red). (b) This panel is similar to panel (a), but compares the SVR versus PBM-derived PSSMs for the 114 mouse TFs.
Figure 3
Figure 3. SVRs improve in vivo occupancy prediction in yeast.
Predicted binding profiles for (a) yeast TF Ume6 along IGR iYFL022C and (b) yeast TF Gal4 along IGR iYFR026C using log-odds ratios for the PBM-derived PSSM motif (gold); max E-score, considering only 8-mer patterns satisfying a minimal E-score threshold of 0.35 (blue); E-score based occupancy, plotting median probe intensity for 8-mer patterns with maximal E-score (black); and SVR prediction scores (green). (c) Scatter plots showing occupancy score predictions (formula image-axis) versus SVR (formula image-axis) for yeast in vivo binding preferences as measured by detection of the top 200 IGRs by the top 200 predictions.
Figure 4
Figure 4. Predicting TF occupancy in mouse and human genomes as evaluated on ChIP-seq data.
(a) SVRs trained on PBM arrays are able to capture ChIP-seq peaks better than PSSMs or the occupancy score. (b) SVMs trained on ChIP-seq data capture sequence information from the genomic context of ChIP-seq peaks and improve in vivo prediction performance.
Figure 5
Figure 5. Sequence feature analysis of in vitro and in vivo models.
We plot formula image-mers contributing to the (a) Oct4 PBM model and (b) Sox2 ChIP model, where each point represents a 13-mer and is colored according to its model weight (red for high weights, blue for low weights). Star and circle point styles indicate different clusters. For the PBM-derived model, the clusters appear to represent primary and secondary binding motifs, with the more degenerate motif perhaps arising as an artifact of the PBM experiment. For the ChIP-derived model, the clusters correspond to the motifs for Sox2 and its cofactor Oct4. (c) PBM-derived PSSMs for Sox12 and Pou2f3, downloaded from UniPROBE, and ChIP-derived PSSM for Sox2, computed using MDscan on the Sox2 ChIP-peak sequences (60bp long).

References

    1. Fulton D, Sundararajan S, Badis G, Hughes T, Wasserman W, et al. TFCat: the curated catalog of mouse and human transcription factors. Genome Biol. 2009;10:R29. - PMC - PubMed
    1. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. - PMC - PubMed
    1. Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotechnol. 2006;12:1249–1435. - PMC - PubMed
    1. Noble WS. Support vector machine applications in computational biology, MIT Press, chapter 3. Computational Molecular Biology. 2004 Available: http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/....
    1. Leslie C, Eskin E, Cohen A, Weston J, Noble WS. Mismatch string kernels for discriminative protein classification. Bioinformatics. 2004;20:467–76. - PubMed

Publication types

MeSH terms

LinkOut - more resources