Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 17;9(6):e99844.
doi: 10.1371/journal.pone.0099844. eCollection 2014.

OccuPeak: ChIP-Seq peak calling based on internal background modelling

Affiliations

OccuPeak: ChIP-Seq peak calling based on internal background modelling

Bouke A de Boer et al. PLoS One. .

Abstract

ChIP-seq has become a major tool for the genome-wide identification of transcription factor binding or histone modification sites. Most peak-calling algorithms require input control datasets to model the occurrence of background reads to account for local sequencing and GC bias. However, the GC-content of reads in Input-seq datasets deviates significantly from that in ChIP-seq datasets. Moreover, we observed that a commonly used peak calling program performed equally well when the use of a simulated uniform background set was compared to an Input-seq dataset. This contradicts the assumption that input control datasets are necessary to fatefully reflect the background read distribution. Because the GC-content of the abundant single reads in ChIP-seq datasets is similar to those of randomly sampled regions we designed a peak-calling algorithm with a background model based on overlapping single reads. The application, OccuPeak, uses the abundant low frequency tags present in each ChIP-seq dataset to model the background, thereby avoiding the need for additional datasets. Analysis of the performance of OccuPeak showed robust model parameters. Its measure of peak significance, the excess ratio, is only dependent on the tag density of a peak and the global noise levels. Compared to the commonly used peak-calling applications MACS and CisGenome, OccuPeak had the highest sensitivity in an enhancer identification benchmark test, and performed similar in an overlap tests of transcription factor occupation with DNase I hypersensitive sites and H3K27ac sites. Moreover, peaks called by OccuPeak were significantly enriched with cardiac disease-associated SNPs. OccuPeak runs as a standalone application and does not require extensive tweaking of parameters, making its use straightforward and user friendly.

Availability: http://occupeak.hfrc.nl.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Correlation between Input-seq datasets depends on repeated sequences.
A. UCSC genome browser snapshot showing tag counts (log scale) in 1 KB bins of two replicate Input-seq datasets. High tag counts are related to annotated genomic repeats. B. Correlation between tag counts in two replicate Input-seq datasets for bins without or with genomic repeats (yellow area: bins with tag counts between 1 and 8, blue: between 1 and 20, red: between 1 and infinity). Bins without any tags were excluded from the analysis because they might be the result of unmappable regions. C. The small overlap (green) between peaks called in ChIP-seq datasets (yellow) and an Input-seq dataset (blue) is significantly reduced when only uniquely mappable (um) reads are considered in peak calling. This is effect is independent of the number of called peaks.
Figure 2
Figure 2. Reviewing evidence of GC-bias in ChIP-seq data.
The GC-content was determined for various classes of genomic regions. The GC-content distribution per class is shown in boxplots (whiskers range from 2.5 to 97.5%). A. The GC-content distribution of various classes of regulatory elements is plotted next that of random genomic regions (genome background). B. The GC-content distribution of genomic regions covered by single tags, resulting from various ChIP-seq experiments, is plotted. The red dotted lines indicate the inter-quartile range of the genome background. C. The GC-content distribution of genomic regions covered by tag accumulations (30–40 tags), resulting from various ChIP-seq experiments, is plotted. The green dotted lines indicate the inter-quartile range of validated cardiac enhancers.
Figure 3
Figure 3. Performance of MACS using Input-seq and simulated input data.
MACS was used to call peaks (only chromosome 1) using the p300(1) dataset. Heart Input-seq data or a simulated uniform background dataset were used as input control. The influence of the input control set on peak-calling performance was measured using overlap with DHSs as outlined in the legend of Figure 8.
Figure 4
Figure 4. Effect of window size and tag density on the pattern and number of called peaks.
Peaks were called with OccuPeak in the TBX3 ChIP-seq dataset using different window sizes and tag densities. A. UCSC genome browser snapshot capturing the effects on peak calling in a region containing 2 validated cardiac enhancers. B. Mean number of peaks called per Mb of genome. Note the (almost perfect) parallelism of the profiles for different tag density (100% and 12.5%) and window size (chromosome and 0.1 Mb). C. Effect of window size on the gain or loss of peaks. When the peaks called with a chromosome-wide window are used as a reference (green), smaller windows lead to loss of peaks (blue) but hardly ever to gain of peaks (yellow).
Figure 5
Figure 5. Consistency of different peak-calling methods.
OccuPeak, MACS and CisGenome were used to call peaks for each of the two replicate p300 ChIP-seq experiments generated by the ENCODE consortium (GSE29184). A. Peaks are considered common (green) if they were identified in both replicates and singleton if they were only found in the current replicate (yellow and blue), as depicted in the UCSC genome browser example (B).
Figure 6
Figure 6. Biological Validation: overlap with cardiac enhancers.
OccuPeak, MACS and CisGenome were used to call peaks from the TBX3 and the two replicate p300 ChIP-seq datasets. Peaks were then sorted on peak significance and overlap with cardiac enhancers was determined. For visualization, the number of most significant peaks was incremented in steps of 1000 peaks. A set of 102 validated cardiac enhancers was used to assess the sensitivity of the peak-calling method and the biological relevance of the called peaks. The number of enhancers identified using the default threshold of each peak calling method is plotted in the bar graphs.
Figure 7
Figure 7. Visualization of overlap analysis.
Visual inspection with the UCSC genome browser can show where and why certain enhancers are missed by a particular peak-calling method. A. Relatively small local increases in input control tag density can result in a locally decreased sensitivity of the method. An enhancer on the Foxl1 locus is missed by MACS when heart Input-seq data is used as input control, but detected when a simulated uniform dataset is used as control instead. B. Similarly, an enhancer located on the Tbx20 locus is missed by MACS when an input control is used on the p300(2) data. When applying the same input control on the more abundant TBX3 data, the enhancer is marked by all methods. Abbreviations: um  =  dataset in which only unique tags are mapped; sim-control  =  dataset where simulated uniform data is used as input control for peak-calling.
Figure 8
Figure 8. Biological Validation: overlap with cardiac DHSs.
OccuPeak, MACS and CisGenome were used to call peaks from the TBX3 and the two replicate p300 ChIP-seq datasets. Peaks were then sorted on peak significance and overlap with cardiac enhancers was determined. For visualization, the number of most significant peaks was incremented in steps of 1000 peaks. Overlap of peaks with DNaseI hypersensitivity sites (DHSs) found in heart tissue was used to assess the positive predictive value of the peak-calling methods. In the p300(2) dataset the performance of OccuPeak was significantly better when only uniquely mappable tags were considered. The results of the statistical comparison at the maximum common number of peaks (vertical dotted line) is given as a string in which ' = ' indicates that the overlap is not significantly different between the methods and '>' that the overlap differs significantly at p<0.0001 or less (O = OccuPeak, all reads; OU = OccuPeak, uniquely mappable reads; M = MACS; C = Cisgenome).
Figure 9
Figure 9. Biological Validation: overlap with cardiac H3K27ac sites.
OccuPeak, MACS and CisGenome were used to call peaks from the TBX3 and the two replicate p300 ChIP-seq datasets. Peaks were then sorted on peak significance and overlap with cardiac enhancers was determined. For visualization, the number of most significant peaks was incremented in steps of 1000 peaks. Overlap of peaks with H3K27ac sites was assessed as measure for active enhancers. In the p300(2) dataset the performance of OccuPeak was significantly better when only uniquely mappable tags were considered. The results of the statistical comparison at the maximum common number of peaks (vertical dotted line) is given as a string in which ' = ' indicates that the overlap is not significantly different between the methods and '>' that the overlap differs significantly at p<0.0001 or less (O = OccuPeak, all reads; OU = OccuPeak, uniquely mappable reads; M = MACS; C = Cisgenome).

Similar articles

Cited by

References

    1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell 129: 823–837. - PubMed
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4: 651–657. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 1497–1502. - PubMed
    1. Hard T, Lundback T (1996) Thermodynamics of sequence-specific protein-DNA interactions. Biophys Chem 62: 121–139. - PubMed
    1. Teytelman L, Ozaydin B, Zill O, Lefrancois P, Snyder M, et al. (2009) Impact of chromatin structures on DNA processing for genomic analyses. PLOS ONE 4: e6700. - PMC - PubMed

Publication types