A flexible ChIP-sequencing simulation toolkit

An Zheng¹, Michael Lamkin², Yutong Qiu^{1

3}, Kevin Ren⁴, Alon Goren⁵, Melissa Gymrek^{6

7}

Affiliations

¹ Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
² Department of Bioengineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
³ School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA.
⁴ Department of Mathematics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
⁵ Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. agoren@ucsd.edu.
⁶ Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. mgymrek@ucsd.edu.
⁷ Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. mgymrek@ucsd.edu.

PMID: 33879052
PMCID: PMC8056602
DOI: 10.1186/s12859-021-04097-5

A flexible ChIP-sequencing simulation toolkit

An Zheng et al. BMC Bioinformatics. 2021.

. 2021 Apr 20;22(1):201.

doi: 10.1186/s12859-021-04097-5.

Authors

An Zheng¹, Michael Lamkin², Yutong Qiu^{1

3}, Kevin Ren⁴, Alon Goren⁵, Melissa Gymrek^{6

7}

Affiliations

¹ Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
² Department of Bioengineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
³ School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA.
⁴ Department of Mathematics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
⁵ Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. agoren@ucsd.edu.
⁶ Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. mgymrek@ucsd.edu.
⁷ Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. mgymrek@ucsd.edu.

PMID: 33879052
PMCID: PMC8056602
DOI: 10.1186/s12859-021-04097-5

Abstract

Background: A major challenge in evaluating quantitative ChIP-seq analyses, such as peak calling and differential binding, is a lack of reliable ground truth data. Accurate simulation of ChIP-seq data can mitigate this challenge, but existing frameworks are either too cumbersome to apply genome-wide or unable to model a number of important experimental conditions in ChIP-seq.

Results: We present ChIPs, a toolkit for rapidly simulating ChIP-seq data using statistical models of key experimental steps. We demonstrate how ChIPs can be used for a range of applications, including benchmarking analysis tools and evaluating the impact of various experimental parameters. ChIPs is implemented as a standalone command-line program written in C++ and is available from https://github.com/gymreklab/chips .

Conclusions: ChIPs is an efficient ChIP-seq simulation framework that generates realistic datasets over a flexible range of experimental conditions. It can serve as an important component in various ChIP-seq analyses where ground truth data are needed.

Keywords: Bioinformatics; ChIP-sequencing; Command-line program; Epigenomics; Simulation tool.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
ChIPs overview. a Overview of the ChIPs model. ChIPs models four steps: shearing (top), pulldown (middle), PCR (bottom), and sequencing. Top: the dark blue histogram shows an example fragment length distribution from real paired end ChIP-seq data. The red line shows the best fit gamma distribution. Middle: pulldown is modeled using two parameters; f (the fraction of the genome bound by the factor) and s (the probability that a pulled down fragment is bound. Bottom: The dark blue histogram shows an example of a distribution of the numbers of PCR duplicates in real ChIP-seq data. The red line shows the best fit geometric distribution. b Schematic of ChIPs modules. The learn module takes an existing ChIP-seq experiment (aligned reads and peaks) and learns model parameters (see Additional file 1: Supplementary Table 2). The simulation module takes as input a set of peaks and model parameters, simulates a ChIP-seq experiment, and returns raw reads in FASTQ format. Model parameters input to the simulation module may either be learned from an existing ChIP-seq dataset (dashed arrow) or manually specified to capture planned experimental conditions. Purple borders represent input or output files and black boxes denote ChIPs commands. Boxes with solid lines denote required inputs. Boxes with dashed borders denote optional inputs. “Exp. params” denotes experimental parameters including the number of reads, read length, and number of simulation rounds. “Aln reads” denotes aligned reads in BAM format. c Example coverage profiles of real versus simulated data. The bottom track shows peaks identified by ENCODE, with normalized peak scores between 0 to 1 colored based on a gradient from white to red. The middle track shows coverage profiles based on aligned reads from ENCODE, and the top track shows coverage profiles based on ChIPs simulations. Coverage profiles were generated using IGV. Coverage profiles may also be viewed interactively at https://tinyurl.com/y7x6ggdq. d Concordance of read counts between simulated versus real ChIP-seq data. chr22 was divided into non-overlapping 5 kb bins. The scatter plot shows the comparison of read counts per bin for bins overlapping peaks (dark blue) or background regions (dark red). The x- and y-axes are on a log10 scale. The plot shown is for 100 simulated genome copies. e Read count correlation between real and simulated data as a function of number of simulated genome copies. For each number of copies, the correlation was computed between read counts in 5 kb bins overlapping input peaks. The x-axis is on a log10 scale. f Simulation run time as a function of number of simulated genome copies. The x-axis is on a log10 scale

**Fig. 2**
Example ChIPs applications. a–d Evaluation of the effects of varying experimental parameters on peak calling performance. Results are based on simulation of generic TF and HM datasets for chr21 as described in Additional file 1: Supplementary Methods. In each plot the y-axis shows the F1 score computed by comparing ground truth peaks to those inferred from simulated datasets using MACS2. a F1 score as a function of the total number of reads simulated from chr21. b F1 score as a function of read length. c F1 score as a function of PCR duplicates. The x-axis gives the parameter p, which can be interpreted as the percent of fragments with no PCR duplicates (Additional file 1: Supplementary Table 2). d F1 score as a function of mean fragment length (bp). Red = HM; Blue = TF; solid lines=paired end reads; dashed lines=single end reads. e–f Evaluation of various peak calling methods on simulated TF (e) and HM (f) datasets with different noise levels. Noise levels are quantified using s, the fraction of pulled down reads that originate from true binding sites. Blue = BCP; orange = GEM; green = MACS2; red = MUSIC; purple = HOMER

See this image and copyright information in PMC

References

1. Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat. Rev. Genet. 2012;13(12):840–852. doi: 10.1038/nrg3306. - DOI - PMC - PubMed
1. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9(9):137. doi: 10.1186/gb-2008-9-9-r137. - DOI - PMC - PubMed
1. Harmanci A, Rozowsky J, Gerstein M. MUSIC: identification of enriched regions in ChIP-Seq experiments using a mappability-corrected multiscale signal processing framework. Genome Biol. 2014;15(10):474. doi: 10.1186/s13059-014-0474-3. - DOI - PMC - PubMed
1. Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ, Brown GD, Gojis O, Ellis IO, Green AR, Ali S, Chin SF, Palmieri C, Caldas C, Carroll JS. Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature. 2012;481(7381):389–393. doi: 10.1038/nature10730. - DOI - PMC - PubMed
1. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

1R21HG010070/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A flexible ChIP-sequencing simulation toolkit

Affiliations

A flexible ChIP-sequencing simulation toolkit

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources