. 2012 Mar 1;28(5):607-13.

doi: 10.1093/bioinformatics/bts009. Epub 2012 Jan 19.

An effective statistical evaluation of ChIPseq dataset similarity

Maria D Chikina¹, Olga G Troyanskaya

Affiliations

PMID: 22262674
PMCID: PMC3339511
DOI: 10.1093/bioinformatics/bts009

An effective statistical evaluation of ChIPseq dataset similarity

Maria D Chikina et al. Bioinformatics. 2012.

. 2012 Mar 1;28(5):607-13.

doi: 10.1093/bioinformatics/bts009. Epub 2012 Jan 19.

Authors

Maria D Chikina¹, Olga G Troyanskaya

Affiliation

¹ Department of Neurology, Mount Sinai School of Medicine, New York, NY 10029, USA.

PMID: 22262674
PMCID: PMC3339511
DOI: 10.1093/bioinformatics/bts009

Abstract

Motivation: ChIPseq is rapidly becoming a common technique for investigating protein-DNA interactions. However, results from individual experiments provide a limited understanding of chromatin structure, as various chromatin factors cooperate in complex ways to orchestrate transcription. In order to quantify chromtain interactions, it is thus necessary to devise a robust similarity metric applicable to ChIPseq data. Unfortunately, moving past simple overlap calculations to give statistically rigorous comparisons of ChIPseq datasets often involves arbitrary choices of distance metrics, with significance being estimated by computationally intensive permutation tests whose statistical power may be sensitive to non-biological experimental and post-processing variation.

Results: We show that it is in fact possible to compare ChIPseq datasets through the efficient computation of exact P-values for proximity. Our method is insensitive to non-biological variation in datasets such as peak width, and can rigorously model peak location biases by evaluating similarity conditioned on a restricted set of genomic regions (such as mappable genome or promoter regions). Applying our method to the well-studied dataset of Chen et al. (2008), we elucidate novel interactions which conform well with our biological understanding. By comparing ChIPseq data in an asymmetric way, we are able to observe clear interaction differences between cofactors such as p300 and factors that bind DNA directly.

Availability: Source code is available for download at http://sonorus.princeton.edu/IntervalStats/IntervalStats.tar.gz.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
A hypothetical domain set (black), reference set (red) and query interval (blue). All possible midpoint locations for the query are shown in black dots. Locations where the minimum distance is at most 2 are denoted in blue.

**Fig. 2.**
(A) P-value histogram and Q–Q plot generated from intervals randomly placed on mouse chromosomes. (B) P-value histograms generated from real data in (Chen *et al.*, 2008).

**Fig. 3.**
(A) The mapping of proximity to P-value for the various ChIPseq experiments from Chen *et al.* (2008). The functions differ significantly demonstrating that proximity statistics are not comparable and the need for a rigorous statistical method. (B) Bar graph showing total coverage for the datasets in A sorted by their P-value at 1000 bp.

**Fig. 4.**
Robustness of our method to interval expansion. (A) Reference (c-Myc) and query (n-Myc) are expanded to 500 bp on both sides, representing a more permissive peak calling parameter. If reference and query are expanded by 500 bp on both sides, the resulting P-values are exactly the same. (B) When a more realistic perturbation of random expansion (mean 500 bp) is applied, only small P-values are affected while the distribution shape remains constant.

**Fig. 5.**
Effects of applying background correction to simulated data. Two non-interacting transcription factors were simulated by choosing random binding sites along the chromosome with sites in promoters over-represented by a factor of 2.5. The two datasets were tested for association using different backgrounds: chromosome background (black), correct promoter background (red), noisy promoter background, where promoter regions are allowed to shift expand and contract (green), and a conservative noisy promoter set which is a strict subset of the correct set (blue).

**Fig. 6.**
Effects of applying background correction to real and simulated data. Association between Suz12 and Oct4 seen using the chromosome background (A) disappears when the promoter correction is applied (B). Corrected P-values are near uniform (P = 0.104, KS-test).

**Fig. 7.**
Heatmap for promoter-corrected similarity values for all factors profiled in Chen *et al.* (2008). Labels on the y-axis represent queries, whereas labels on the x-axis represent references.

**Fig. 8.**
Graph representation of interactions in Figure 7. Top 35 interactions are included. Two main clusters highlighted in red and blue have an overall hierarchical relationship.

See this image and copyright information in PMC

References

1. Carstensen L., et al. Multivariate Hawkes process models of the occurrence of regulatory elements. BMC Bioinformatics. 2010;11:456. - PMC - PubMed
1. Chen X., et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. - PubMed
1. Cuddapah S., et al. Global analysis of the insulator binding protein ctcf in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res. 2009;19:24–32. - PMC - PubMed
1. Fu A.Q., Adryan B. Scoring overlapping and adjacent signals from genome-wide chip and damid assays. Mol. Biosyst. 2009;5:1429–1438. - PMC - PubMed
1. Guttman M., et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009;458:223–227. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An effective statistical evaluation of ChIPseq dataset similarity

Affiliation

An effective statistical evaluation of ChIPseq dataset similarity

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous