. 2010 Feb 10;11 Suppl 1(Suppl 1):S12.

doi: 10.1186/1471-2164-11-S1-S12.

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

Vladimir A Kuznetsov¹, Onkar Singh, Piroon Jenjaroenpun

Affiliations

PMID: 20158869
PMCID: PMC2822526
DOI: 10.1186/1471-2164-11-S1-S12

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

Vladimir A Kuznetsov et al. BMC Genomics. 2010.

. 2010 Feb 10;11 Suppl 1(Suppl 1):S12.

doi: 10.1186/1471-2164-11-S1-S12.

Authors

Vladimir A Kuznetsov¹, Onkar Singh, Piroon Jenjaroenpun

Affiliation

¹ Department of Genome and Gene Expression Data Analysis, Bioinformatics Institute, 30 Biopolis str, Singapore. vladimirk@bii.a-star.edu.sg

PMID: 20158869
PMCID: PMC2822526
DOI: 10.1186/1471-2164-11-S1-S12

Abstract

Background: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type.

Results: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes.

Conclusion: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.

PubMed Disclaimer

Figures

**Figure 1**
**A random sampling model of determination of TF binding avidity potential on the genome defined in a ChIP-seq experiment**. Sequencing TFBS-enriched DNA fragments can be assayed to determine the specific clusters of DNA sequences bound by TF protein. Results are strongly depended from the number of read (sample size). A: Small sample size. B: Large sample size. Blue horizontal stake: specific DNA fragment; white horizontal stake: non-specific DNA fragment forming non-specific (false-positive) clusters. C: BS1-BS6 are binding loci presented in the given cells: blue vertical stakes are relative binding avidity in the loci; BS6 might be modified (epigenetically) and suppressed BS (a stake with triangle basis) and therefore it might be not detected in ChIp-seq assay. BS1 and BS4 might be not detected in the assay due to sample size limit. D: A scheme of the random Markov process of binding-dissociation of TF-DNA realized on the genome scale. The graph illustrates concept of birth-death random process model utilized in this work (see Methods).

**Figure 2**
**Observed and predicted statistics of TF--DNA BEs**. A: Fitting and back extrapolation analysis for complete dataset. Decomposition of mixture model (1) for Nanog TF-DNA BEs is provided based on curve-fitting analysis of the model. Close circle: number of loci of ChIP-seq extended DNA cluster overlaps from 1 to 8 BEs. Open circle: number of loci of ChIP-seq extended DNA cluster overlaps from 9 to 73 (included) BEs. Noise-like (close circles) data fits well be exponential function with exponent parameter s = 1.05 ± 0.055 (p < 0.0001, t-test). The reliable set of TF BS (at >8 BEs) are equally well fitted by the left-side truncated GDP function (at k = 1.81 ± 0.15 (p < 0.001, t-test) and b = 8.00 ± 1.335 (p < 0.001, t-test)) as well as by K-W function (θ = 0.999, a = 6.618, b = 8.29; Table 3). Extrapolation curve predicts the number of Nanog TFBSs in the noise-enriched binding site fraction of the empirical distribution. B: Nanog TF-DNA BEs, C: Esrrb TF-DNA BEs and D: c-Myc TF-DNA BEs. B, C and D: K-W model fitting on the observed and extrapolated of double-truncated GDP data to calculate p₀. Vertical dotted lines are representing qPCR-defined threshold and the threshold defined based on best-fit double-truncated GDP function. Triangle symbols show the observed over represented number of TFBSs in compare to best-fit GDP function. N₀, N₁and N₂are the numbers of non-detected, potentially detected and high specific (reliable) TFBSs, respectively. More detail information about parameter values of GDP and K-W models presents in Additional File 3, 4, 5.

**Figure 3**
**Suboptimal design of the ChIP-qPCR experiment**. Statistics of BEs in Esrrb TF ChIP-seq data is following to skewed distribution. Difference in the frequency distributions of BEs for peaks used in qPCR and in random samples chosen from Esrrb TF ChIP-seq library at peak values >11 available for this dataset. In Chen et al [15], to determine the specificity, the peak height critical threshold were determined by 3-fold enriched qPCR signal/noise criteria.

**Figure 4**
**Three segments in the range of TF-DNA BEs count for 11 TFs of mouse E14 embryonic cells**.

**Figure 5**
**Validation of ChIP-seq defined c-Myc binding loci based on motif finding analysis**. A: PWM of c-Myc TFBSs defined with NestedMICA program trained with 12 peak height or higher value defined in ChIP-seq experiment. B: Distribution of E-box sequences in ± 1 kb from the centre of ChIP-seq defined binding loci. C: Frequency distribution of the number of ChIP-seq overlapped DNA fragments (peak height). ◊: All ChIP-seq c-Myc bound loci for observed peak heights. o: E-boxes positive loci found in vicinity ± 250 bp. ∇: E-boxes positive loci found in vicinity ± 150 bp. D: Venn diagram of co-occurrence of E-boxes in ± 150 bp of c-Myc binding loci (left side). Pair of Kappa correlation coefficient of co-occurrence of E-boxes in c-Myc binding loci (right side).

**Figure 6**
**Multiple occurrence of c-Myc E-boxes in promoter region around transcription start site (TSS) of c-Myc target genes identified in ChIP-seq experiment**. A: WEE 1 homolog 1(Wee1). The high-avidity (height peak = 76) c-Myc binding sites (pointed by blue arrow) in strong promoter region of Wee1 gene is supported with two canonical E-box CACGTG. The binding locus and E-boxes are overlapped with CpG Island which might be bound by c-Myc. B: Nucleoplasmin 3 (Npm3). The moderate-avidity (height peak = 10) as in the previous case the ChIP-seq c-Myc binding locus is located in the first intron promoter region. The locus is supported with five E-boxes: one canonical and three non-canonical E-boxes in first intron and another canonical E-box in second exon. In addition, this binding region and the E-boxes are located in CpG Island. C: FK506 binding protein 5 (Fkbp5). Two relatively low avidity c-Myc binding sites identified in ChIP-seq experiment confirmed with E-boxes. First ChIP-seq loci in upstream gene region has relatively low avidity biding site (height peak = 7) supported with canonical E-box CACGTG. The second ChIP-seq loci located in first intron and it is also relatively low avidity peak (height peak = 7) which is supported with two non-canonical E-boxes CACGCG and two non-canonical E-boxes CGCGAG. The last locus overlaps with CpG Island which suggests that this locus might be functional.

See this image and copyright information in PMC

Cited by

MYCT1-TV, a novel MYCT1 transcript, is regulated by c-Myc and may participate in laryngeal carcinogenesis.
Fu S, Guo Y, Chen H, Xu ZM, Qiu GB, Zhong M, Sun KL, Fu WN. Fu S, et al. PLoS One. 2011;6(10):e25648. doi: 10.1371/journal.pone.0025648. Epub 2011 Oct 5. PLoS One. 2011. PMID: 21998677 Free PMC article.
Multiple signatures of a disease in potential biomarker space: Getting the signatures consensus and identification of novel biomarkers.
Ow GS, Kuznetsov VA. Ow GS, et al. BMC Genomics. 2015;16 Suppl 7(Suppl 7):S2. doi: 10.1186/1471-2164-16-S7-S2. Epub 2015 Jun 11. BMC Genomics. 2015. PMID: 26100469 Free PMC article.
Quantitative model of R-loop forming structures reveals a novel level of RNA-DNA interactome complexity.
Wongsurawat T, Jenjaroenpun P, Kwoh CK, Kuznetsov V. Wongsurawat T, et al. Nucleic Acids Res. 2012 Jan;40(2):e16. doi: 10.1093/nar/gkr1075. Epub 2011 Nov 25. Nucleic Acids Res. 2012. PMID: 22121227 Free PMC article.
Role of IL-9 and STATs in hematological malignancies (Review).
Chen N, Wang X. Chen N, et al. Oncol Lett. 2014 Mar;7(3):602-610. doi: 10.3892/ol.2013.1761. Epub 2013 Dec 16. Oncol Lett. 2014. PMID: 24520283 Free PMC article.
Promoter hypermethylation-induced transcriptional down-regulation of the gene MYCT1 in laryngeal squamous cell carcinoma.
Yang M, Li W, Liu YY, Fu S, Qiu GB, Sun KL, Fu WN. Yang M, et al. BMC Cancer. 2012 Jun 6;12:219. doi: 10.1186/1471-2407-12-219. BMC Cancer. 2012. PMID: 22672838 Free PMC article.

See all "Cited by" articles

References

1. Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE) Genome Res. 2007;17(6):910–916. doi: 10.1101/gr.5574907. - DOI - PMC - PubMed
1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, NY) 2007;316(5830):1497–1502. - PubMed
1. Loh Y-H, Wu Q, Chew J-L, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nature genetics. 2006;38(4):431–440. doi: 10.1038/ng1760. - DOI - PubMed
1. Mardis ER. ChIP-seq: welcome to the new frontier. Nature methods. 2007;4(8):613–614. doi: 10.1038/nmeth0807-613. - DOI - PubMed
1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature methods. 2007;4(8):651–657. doi: 10.1038/nmeth1068. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

Affiliation

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous