. 2014 Jun 17;9(6):e99844.

doi: 10.1371/journal.pone.0099844. eCollection 2014.

OccuPeak: ChIP-Seq peak calling based on internal background modelling

Bouke A de Boer¹, Karel van Duijvenboden¹, Malou van den Boogaard¹, Vincent M Christoffels¹, Phil Barnett¹, Jan M Ruijter¹

Affiliations

PMID: 24936875
PMCID: PMC4061025
DOI: 10.1371/journal.pone.0099844

OccuPeak: ChIP-Seq peak calling based on internal background modelling

Bouke A de Boer et al. PLoS One. 2014.

. 2014 Jun 17;9(6):e99844.

doi: 10.1371/journal.pone.0099844. eCollection 2014.

Authors

Bouke A de Boer¹, Karel van Duijvenboden¹, Malou van den Boogaard¹, Vincent M Christoffels¹, Phil Barnett¹, Jan M Ruijter¹

Affiliation

¹ Department of Anatomy, Embryology & Physiology, Academic Medical Centre, Amsterdam, The Netherlands.

PMID: 24936875
PMCID: PMC4061025
DOI: 10.1371/journal.pone.0099844

Abstract

ChIP-seq has become a major tool for the genome-wide identification of transcription factor binding or histone modification sites. Most peak-calling algorithms require input control datasets to model the occurrence of background reads to account for local sequencing and GC bias. However, the GC-content of reads in Input-seq datasets deviates significantly from that in ChIP-seq datasets. Moreover, we observed that a commonly used peak calling program performed equally well when the use of a simulated uniform background set was compared to an Input-seq dataset. This contradicts the assumption that input control datasets are necessary to fatefully reflect the background read distribution. Because the GC-content of the abundant single reads in ChIP-seq datasets is similar to those of randomly sampled regions we designed a peak-calling algorithm with a background model based on overlapping single reads. The application, OccuPeak, uses the abundant low frequency tags present in each ChIP-seq dataset to model the background, thereby avoiding the need for additional datasets. Analysis of the performance of OccuPeak showed robust model parameters. Its measure of peak significance, the excess ratio, is only dependent on the tag density of a peak and the global noise levels. Compared to the commonly used peak-calling applications MACS and CisGenome, OccuPeak had the highest sensitivity in an enhancer identification benchmark test, and performed similar in an overlap tests of transcription factor occupation with DNase I hypersensitive sites and H3K27ac sites. Moreover, peaks called by OccuPeak were significantly enriched with cardiac disease-associated SNPs. OccuPeak runs as a standalone application and does not require extensive tweaking of parameters, making its use straightforward and user friendly.

Availability: http://occupeak.hfrc.nl.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Correlation between Input-seq datasets depends on repeated sequences.**
A. UCSC genome browser snapshot showing tag counts (log scale) in 1 KB bins of two replicate Input-seq datasets. High tag counts are related to annotated genomic repeats. B. Correlation between tag counts in two replicate Input-seq datasets for bins without or with genomic repeats (yellow area: bins with tag counts between 1 and 8, blue: between 1 and 20, red: between 1 and infinity). Bins without any tags were excluded from the analysis because they might be the result of unmappable regions. C. The small overlap (green) between peaks called in ChIP-seq datasets (yellow) and an Input-seq dataset (blue) is significantly reduced when only uniquely mappable (um) reads are considered in peak calling. This is effect is independent of the number of called peaks.

**Figure 2. Reviewing evidence of GC-bias in ChIP-seq data.**
The GC-content was determined for various classes of genomic regions. The GC-content distribution per class is shown in boxplots (whiskers range from 2.5 to 97.5%). A. The GC-content distribution of various classes of regulatory elements is plotted next that of random genomic regions (genome background). B. The GC-content distribution of genomic regions covered by single tags, resulting from various ChIP-seq experiments, is plotted. The red dotted lines indicate the inter-quartile range of the genome background. C. The GC-content distribution of genomic regions covered by tag accumulations (30–40 tags), resulting from various ChIP-seq experiments, is plotted. The green dotted lines indicate the inter-quartile range of validated cardiac enhancers.

**Figure 3. Performance of MACS using Input-seq and simulated input data.**
MACS was used to call peaks (only chromosome 1) using the p300(1) dataset. Heart Input-seq data or a simulated uniform background dataset were used as input control. The influence of the input control set on peak-calling performance was measured using overlap with DHSs as outlined in the legend of Figure 8.

**Figure 4. Effect of window size and tag density on the pattern and number of called peaks.**
Peaks were called with OccuPeak in the TBX3 ChIP-seq dataset using different window sizes and tag densities. A. UCSC genome browser snapshot capturing the effects on peak calling in a region containing 2 validated cardiac enhancers. B. Mean number of peaks called per Mb of genome. Note the (almost perfect) parallelism of the profiles for different tag density (100% and 12.5%) and window size (chromosome and 0.1 Mb). C. Effect of window size on the gain or loss of peaks. When the peaks called with a chromosome-wide window are used as a reference (green), smaller windows lead to loss of peaks (blue) but hardly ever to gain of peaks (yellow).

**Figure 5. Consistency of different peak-calling methods.**
OccuPeak, MACS and CisGenome were used to call peaks for each of the two replicate p300 ChIP-seq experiments generated by the ENCODE consortium (GSE29184). A. Peaks are considered common (green) if they were identified in both replicates and singleton if they were only found in the current replicate (yellow and blue), as depicted in the UCSC genome browser example (B).

**Figure 6. Biological Validation: overlap with cardiac enhancers.**
OccuPeak, MACS and CisGenome were used to call peaks from the TBX3 and the two replicate p300 ChIP-seq datasets. Peaks were then sorted on peak significance and overlap with cardiac enhancers was determined. For visualization, the number of most significant peaks was incremented in steps of 1000 peaks. A set of 102 validated cardiac enhancers was used to assess the sensitivity of the peak-calling method and the biological relevance of the called peaks. The number of enhancers identified using the default threshold of each peak calling method is plotted in the bar graphs.

**Figure 7. Visualization of overlap analysis.**
Visual inspection with the UCSC genome browser can show where and why certain enhancers are missed by a particular peak-calling method. A. Relatively small local increases in input control tag density can result in a locally decreased sensitivity of the method. An enhancer on the Foxl1 locus is missed by MACS when heart Input-seq data is used as input control, but detected when a simulated uniform dataset is used as control instead. B. Similarly, an enhancer located on the Tbx20 locus is missed by MACS when an input control is used on the p300(2) data. When applying the same input control on the more abundant TBX3 data, the enhancer is marked by all methods. Abbreviations: um = dataset in which only unique tags are mapped; sim-control = dataset where simulated uniform data is used as input control for peak-calling.

**Figure 8. Biological Validation: overlap with cardiac DHSs.**
OccuPeak, MACS and CisGenome were used to call peaks from the TBX3 and the two replicate p300 ChIP-seq datasets. Peaks were then sorted on peak significance and overlap with cardiac enhancers was determined. For visualization, the number of most significant peaks was incremented in steps of 1000 peaks. Overlap of peaks with DNaseI hypersensitivity sites (DHSs) found in heart tissue was used to assess the positive predictive value of the peak-calling methods. In the p300(2) dataset the performance of OccuPeak was significantly better when only uniquely mappable tags were considered. The results of the statistical comparison at the maximum common number of peaks (vertical dotted line) is given as a string in which ' = ' indicates that the overlap is not significantly different between the methods and '>' that the overlap differs significantly at p<0.0001 or less (O = OccuPeak, all reads; OU = OccuPeak, uniquely mappable reads; M = MACS; C = Cisgenome).

**Figure 9. Biological Validation: overlap with cardiac H3K27ac sites.**
OccuPeak, MACS and CisGenome were used to call peaks from the TBX3 and the two replicate p300 ChIP-seq datasets. Peaks were then sorted on peak significance and overlap with cardiac enhancers was determined. For visualization, the number of most significant peaks was incremented in steps of 1000 peaks. Overlap of peaks with H3K27ac sites was assessed as measure for active enhancers. In the p300(2) dataset the performance of OccuPeak was significantly better when only uniquely mappable tags were considered. The results of the statistical comparison at the maximum common number of peaks (vertical dotted line) is given as a string in which ' = ' indicates that the overlap is not significantly different between the methods and '>' that the overlap differs significantly at p<0.0001 or less (O = OccuPeak, all reads; OU = OccuPeak, uniquely mappable reads; M = MACS; C = Cisgenome).

See this image and copyright information in PMC

Cited by

Analysis of super-enhancer using machine learning and its application to medical biology.
Hamamoto R, Takasawa K, Shinkai N, Machino H, Kouno N, Asada K, Komatsu M, Kaneko S. Hamamoto R, et al. Brief Bioinform. 2023 May 19;24(3):bbad107. doi: 10.1093/bib/bbad107. Brief Bioinform. 2023. PMID: 36960780 Free PMC article. Review.
EMERGE: a flexible modelling framework to predict genomic regulatory elements from genomic signatures.
van Duijvenboden K, de Boer BA, Capon N, Ruijter JM, Christoffels VM. van Duijvenboden K, et al. Nucleic Acids Res. 2016 Mar 18;44(5):e42. doi: 10.1093/nar/gkv1144. Epub 2015 Nov 3. Nucleic Acids Res. 2016. PMID: 26531828 Free PMC article.
Chromatin Conformation Links Putative Enhancers in Intracranial Aneurysm-Associated Regions to Potential Candidate Genes.
Laarman MD, Geeven G, Barnett P; Netherlands Brain Bank; Rinkel GJE, de Laat W, Ruigrok YM, Bakkers J. Laarman MD, et al. J Am Heart Assoc. 2019 May 7;8(9):e011201. doi: 10.1161/JAHA.118.011201. J Am Heart Assoc. 2019. PMID: 30994044 Free PMC article.
Genome-wide histone modification profiling of inner cell mass and trophectoderm of bovine blastocysts by RAT-ChIP.
Org T, Hensen K, Kreevan R, Mark E, Sarv O, Andreson R, Jaakma Ü, Salumets A, Kurg A. Org T, et al. PLoS One. 2019 Nov 25;14(11):e0225801. doi: 10.1371/journal.pone.0225801. eCollection 2019. PLoS One. 2019. PMID: 31765427 Free PMC article.
Spatiotemporal regulation of enhancers during cardiogenesis.
Dupays L, Mohun T. Dupays L, et al. Cell Mol Life Sci. 2017 Jan;74(2):257-265. doi: 10.1007/s00018-016-2322-y. Epub 2016 Aug 6. Cell Mol Life Sci. 2017. PMID: 27497925 Free PMC article. Review.

See all "Cited by" articles

References

1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell 129: 823–837. - PubMed
1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4: 651–657. - PubMed
1. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 1497–1502. - PubMed
1. Hard T, Lundback T (1996) Thermodynamics of sequence-specific protein-DNA interactions. Biophys Chem 62: 121–139. - PubMed
1. Teytelman L, Ozaydin B, Zill O, Lefrancois P, Snyder M, et al. (2009) Impact of chromatin structures on DNA processing for genomic analyses. PLOS ONE 4: e6700. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

OccuPeak: ChIP-Seq peak calling based on internal background modelling

Affiliation

OccuPeak: ChIP-Seq peak calling based on internal background modelling

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous