Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar 1;28(5):607-13.
doi: 10.1093/bioinformatics/bts009. Epub 2012 Jan 19.

An effective statistical evaluation of ChIPseq dataset similarity

Affiliations

An effective statistical evaluation of ChIPseq dataset similarity

Maria D Chikina et al. Bioinformatics. .

Abstract

Motivation: ChIPseq is rapidly becoming a common technique for investigating protein-DNA interactions. However, results from individual experiments provide a limited understanding of chromatin structure, as various chromatin factors cooperate in complex ways to orchestrate transcription. In order to quantify chromtain interactions, it is thus necessary to devise a robust similarity metric applicable to ChIPseq data. Unfortunately, moving past simple overlap calculations to give statistically rigorous comparisons of ChIPseq datasets often involves arbitrary choices of distance metrics, with significance being estimated by computationally intensive permutation tests whose statistical power may be sensitive to non-biological experimental and post-processing variation.

Results: We show that it is in fact possible to compare ChIPseq datasets through the efficient computation of exact P-values for proximity. Our method is insensitive to non-biological variation in datasets such as peak width, and can rigorously model peak location biases by evaluating similarity conditioned on a restricted set of genomic regions (such as mappable genome or promoter regions). Applying our method to the well-studied dataset of Chen et al. (2008), we elucidate novel interactions which conform well with our biological understanding. By comparing ChIPseq data in an asymmetric way, we are able to observe clear interaction differences between cofactors such as p300 and factors that bind DNA directly.

Availability: Source code is available for download at http://sonorus.princeton.edu/IntervalStats/IntervalStats.tar.gz.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A hypothetical domain set (black), reference set (red) and query interval (blue). All possible midpoint locations for the query are shown in black dots. Locations where the minimum distance is at most 2 are denoted in blue.
Fig. 2.
Fig. 2.
(A) P-value histogram and Q–Q plot generated from intervals randomly placed on mouse chromosomes. (B) P-value histograms generated from real data in (Chen et al., 2008).
Fig. 3.
Fig. 3.
(A) The mapping of proximity to P-value for the various ChIPseq experiments from Chen et al. (2008). The functions differ significantly demonstrating that proximity statistics are not comparable and the need for a rigorous statistical method. (B) Bar graph showing total coverage for the datasets in A sorted by their P-value at 1000 bp.
Fig. 4.
Fig. 4.
Robustness of our method to interval expansion. (A) Reference (c-Myc) and query (n-Myc) are expanded to 500 bp on both sides, representing a more permissive peak calling parameter. If reference and query are expanded by 500 bp on both sides, the resulting P-values are exactly the same. (B) When a more realistic perturbation of random expansion (mean 500 bp) is applied, only small P-values are affected while the distribution shape remains constant.
Fig. 5.
Fig. 5.
Effects of applying background correction to simulated data. Two non-interacting transcription factors were simulated by choosing random binding sites along the chromosome with sites in promoters over-represented by a factor of 2.5. The two datasets were tested for association using different backgrounds: chromosome background (black), correct promoter background (red), noisy promoter background, where promoter regions are allowed to shift expand and contract (green), and a conservative noisy promoter set which is a strict subset of the correct set (blue).
Fig. 6.
Fig. 6.
Effects of applying background correction to real and simulated data. Association between Suz12 and Oct4 seen using the chromosome background (A) disappears when the promoter correction is applied (B). Corrected P-values are near uniform (P = 0.104, KS-test).
Fig. 7.
Fig. 7.
Heatmap for promoter-corrected similarity values for all factors profiled in Chen et al. (2008). Labels on the y-axis represent queries, whereas labels on the x-axis represent references.
Fig. 8.
Fig. 8.
Graph representation of interactions in Figure 7. Top 35 interactions are included. Two main clusters highlighted in red and blue have an overall hierarchical relationship.

References

    1. Carstensen L., et al. Multivariate Hawkes process models of the occurrence of regulatory elements. BMC Bioinformatics. 2010;11:456. - PMC - PubMed
    1. Chen X., et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. - PubMed
    1. Cuddapah S., et al. Global analysis of the insulator binding protein ctcf in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res. 2009;19:24–32. - PMC - PubMed
    1. Fu A.Q., Adryan B. Scoring overlapping and adjacent signals from genome-wide chip and damid assays. Mol. Biosyst. 2009;5:1429–1438. - PMC - PubMed
    1. Guttman M., et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009;458:223–227. - PMC - PubMed

Publication types