Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 10;11 Suppl 1(Suppl 1):S12.
doi: 10.1186/1471-2164-11-S1-S12.

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

Affiliations

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

Vladimir A Kuznetsov et al. BMC Genomics. .

Abstract

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A random sampling model of determination of TF binding avidity potential on the genome defined in a ChIP-seq experiment. Sequencing TFBS-enriched DNA fragments can be assayed to determine the specific clusters of DNA sequences bound by TF protein. Results are strongly depended from the number of read (sample size). A: Small sample size. B: Large sample size. Blue horizontal stake: specific DNA fragment; white horizontal stake: non-specific DNA fragment forming non-specific (false-positive) clusters. C: BS1-BS6 are binding loci presented in the given cells: blue vertical stakes are relative binding avidity in the loci; BS6 might be modified (epigenetically) and suppressed BS (a stake with triangle basis) and therefore it might be not detected in ChIp-seq assay. BS1 and BS4 might be not detected in the assay due to sample size limit. D: A scheme of the random Markov process of binding-dissociation of TF-DNA realized on the genome scale. The graph illustrates concept of birth-death random process model utilized in this work (see Methods).
Figure 2
Figure 2
Observed and predicted statistics of TF--DNA BEs. A: Fitting and back extrapolation analysis for complete dataset. Decomposition of mixture model (1) for Nanog TF-DNA BEs is provided based on curve-fitting analysis of the model. Close circle: number of loci of ChIP-seq extended DNA cluster overlaps from 1 to 8 BEs. Open circle: number of loci of ChIP-seq extended DNA cluster overlaps from 9 to 73 (included) BEs. Noise-like (close circles) data fits well be exponential function with exponent parameter s = 1.05 ± 0.055 (p < 0.0001, t-test). The reliable set of TF BS (at >8 BEs) are equally well fitted by the left-side truncated GDP function (at k = 1.81 ± 0.15 (p < 0.001, t-test) and b = 8.00 ± 1.335 (p < 0.001, t-test)) as well as by K-W function (θ = 0.999, a = 6.618, b = 8.29; Table 3). Extrapolation curve predicts the number of Nanog TFBSs in the noise-enriched binding site fraction of the empirical distribution. B: Nanog TF-DNA BEs, C: Esrrb TF-DNA BEs and D: c-Myc TF-DNA BEs. B, C and D: K-W model fitting on the observed and extrapolated of double-truncated GDP data to calculate p0. Vertical dotted lines are representing qPCR-defined threshold and the threshold defined based on best-fit double-truncated GDP function. Triangle symbols show the observed over represented number of TFBSs in compare to best-fit GDP function. N0, N1 and N2 are the numbers of non-detected, potentially detected and high specific (reliable) TFBSs, respectively. More detail information about parameter values of GDP and K-W models presents in Additional File 3, 4, 5.
Figure 3
Figure 3
Suboptimal design of the ChIP-qPCR experiment. Statistics of BEs in Esrrb TF ChIP-seq data is following to skewed distribution. Difference in the frequency distributions of BEs for peaks used in qPCR and in random samples chosen from Esrrb TF ChIP-seq library at peak values >11 available for this dataset. In Chen et al [15], to determine the specificity, the peak height critical threshold were determined by 3-fold enriched qPCR signal/noise criteria.
Figure 4
Figure 4
Three segments in the range of TF-DNA BEs count for 11 TFs of mouse E14 embryonic cells.
Figure 5
Figure 5
Validation of ChIP-seq defined c-Myc binding loci based on motif finding analysis. A: PWM of c-Myc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIP-seq experiment. B: Distribution of E-box sequences in ± 1 kb from the centre of ChIP-seq defined binding loci. C: Frequency distribution of the number of ChIP-seq overlapped DNA fragments (peak height). ◊: All ChIP-seq c-Myc bound loci for observed peak heights. o: E-boxes positive loci found in vicinity ± 250 bp. ∇: E-boxes positive loci found in vicinity ± 150 bp. D: Venn diagram of co-occurrence of E-boxes in ± 150 bp of c-Myc binding loci (left side). Pair of Kappa correlation coefficient of co-occurrence of E-boxes in c-Myc binding loci (right side).
Figure 6
Figure 6
Multiple occurrence of c-Myc E-boxes in promoter region around transcription start site (TSS) of c-Myc target genes identified in ChIP-seq experiment. A: WEE 1 homolog 1(Wee1). The high-avidity (height peak = 76) c-Myc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical E-box CACGTG. The binding locus and E-boxes are overlapped with CpG Island which might be bound by c-Myc. B: Nucleoplasmin 3 (Npm3). The moderate-avidity (height peak = 10) as in the previous case the ChIP-seq c-Myc binding locus is located in the first intron promoter region. The locus is supported with five E-boxes: one canonical and three non-canonical E-boxes in first intron and another canonical E-box in second exon. In addition, this binding region and the E-boxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity c-Myc binding sites identified in ChIP-seq experiment confirmed with E-boxes. First ChIP-seq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical E-box CACGTG. The second ChIP-seq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.

Similar articles

Cited by

References

    1. Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE) Genome Res. 2007;17(6):910–916. doi: 10.1101/gr.5574907. - DOI - PMC - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, NY) 2007;316(5830):1497–1502. - PubMed
    1. Loh Y-H, Wu Q, Chew J-L, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nature genetics. 2006;38(4):431–440. doi: 10.1038/ng1760. - DOI - PubMed
    1. Mardis ER. ChIP-seq: welcome to the new frontier. Nature methods. 2007;4(8):613–614. doi: 10.1038/nmeth0807-613. - DOI - PubMed
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature methods. 2007;4(8):651–657. doi: 10.1038/nmeth1068. - DOI - PubMed

Publication types