. 2014 Feb 19;4(2):209-23.

doi: 10.1534/g3.113.008680.

Large-scale quality analysis of published ChIP-seq data

Georgi K Marinov¹, Anshul Kundaje, Peter J Park, Barbara J Wold

Affiliations

PMID: 24347632
PMCID: PMC3931556
DOI: 10.1534/g3.113.008680

Large-scale quality analysis of published ChIP-seq data

Georgi K Marinov et al. G3 (Bethesda). 2014.

. 2014 Feb 19;4(2):209-23.

doi: 10.1534/g3.113.008680.

Authors

Georgi K Marinov¹, Anshul Kundaje, Peter J Park, Barbara J Wold

Affiliation

¹ Division of Biology, California Institute of Technology, Pasadena, California 91125.

PMID: 24347632
PMCID: PMC3931556
DOI: 10.1534/g3.113.008680

Abstract

ChIP-seq has become the primary method for identifying in vivo protein-DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.

Keywords: ChIP-seq; chromatin immunoprecipitation; cross-correlation; quality assessment; transcription factor.

PubMed Disclaimer

Figures

**Figure 1**
Sequencing library characteristics. (A) Joint distribution of library complexity and sequencing depth for all datasets examined. Vertical lines are drawn at 1 million, 5 million, and 12 million reads. Horizontal and vertical lines indicate quality classes discussed in the text. The upper right domain (number of uniquely mappable reads ≥12 million and library complexity ≥0.8) passes current quality thresholds. (B) Distribution of library complexity for ChIP-seq datasets, IgG controls, and inputs. (C) Distribution of sequencing depth for ChIP-seq datasets, IgG controls, and sonicated inputs. (D) Fraction of ChIP-seq, IgG, and input datasets exhibiting high, medium, and low complexity. (E) Fraction of studies containing libraries of high, medium, and low complexity (the distribution of the minimum library complexity observed is shown)

**Figure 2**
ChIP QC assessment summary. The numbers in each box indicate the total number of datasets/studies belonging to it. SPP QC scores of +1 and +2 indicate a high degree of read clustering in a dataset. (A) Distribution of SPP QC scores for all ChIP-seq datasets examined. (B) Distribution of SPP QC scores for the best replicates for a factor/condition combination in each study. (C) Distribution of the maximum SPP QC scores for all ChIP-seq datasets in a study.

**Figure 3**
Assessment of read clustering in control datasets. The numbers in each box indicate the total number of datasets/studies belonging to it. SPP QC scores of 1 and 2 indicate a high degree of read clustering in a dataset. (A) Distribution of SPP QC scores for all control datasets (IgG + input), IgG/mock IP controls (IgG), and sonicated inputs (inputs). (B) Fraction of studies containing highly clustered inputs. The distribution of the maximum SPP QC score for all inputs in a dataset is shown. (C) Examples of a highly clustered input [mouse liver, upper two tracks, (MacIsaac *et al.* 2010), QC score of 2] and an input that does not show high extent of read clustering [mouse liver, lower two tracks (Soccio *et al.* 2011), QC score of −1). The promoter of the *MASTL* gene is shown. All tracks are shown to the same scale and reads mapping to the plus and minus strands are displayed separately for better visualization of the cross-correlation between the two.

**Figure 4**
Effect of suboptimal datasets on combinatorial occupancy analysis. The muscle-regulatory factors MyoD and myogenin were assayed in C2C12 myocytes at 60 hr after differentiation. Shown are a single, highly successful MyoD ChIP-seq dataset and three myogenin ChIP-seq datasets, one of which is similarly highly successful (“myogenin 1”), a second weaker one (“myogenin 2”), and a third one that is an experimental failure (“myogenin 3”). (A) Quality control metrics. (B, C, D) The extent of overlap of MyoD and myogenin-binding sites as determined using each of the three myogenin datasets (see *Materials and Methods* for data processing details). MyoD and myogenin are mostly found to bind to the same sites when interactome determinations of comparable strength are used. (B) A sizable group of apparently MyoD-only sites emerges when the medium-strength myogenin dataset is used because of a large number of false-negative myogenin calls. (C) Finally, the unsuccessful myogenin ChIP reveals that most MyoD are not shared by myogenin. (D) Numbers listed in the red blocks corresponding to each set of peak calls indicate size.

See this image and copyright information in PMC

References

1. An C. I., Dong Y., Hagiwara N., 2011. Genome-wide mapping of Sox6 binding sites in skeletal muscle reveals both direct and indirect regulation of muscle terminal differentiation by Sox6. BMC Dev. Biol. 11: 59. - PMC - PubMed
1. Ang Y. S., Tsai S. Y., Lee D. F., Monk J., Su J., et al. , 2011. Wdr5 mediates self-renewal and reprogramming via the embryonic stem cell core transcriptional network. Cell 145: 183–197 - PMC - PubMed
1. Auerbach R. K., Euskirchen G., Rozowsky J., Lamarre-Vincent N., Moqtaderi Z., et al. , 2009. Mapping accessible chromatin regions using Sono-Seq. Proc. Natl. Acad. Sci. USA 106: 14926–14931 - PMC - PubMed
1. Avvakumov N., Lalonde M. E., Saksouk N., Paquet E., Glass K. C., et al. , 2012. Conserved molecular interactions within the HBO1 acetyltransferase complexes regulate cell proliferation. Mol. Cell. Biol. 32: 689–703 - PMC - PubMed
1. Barish G. D., Yu R. T., Karunasiri M., Ocampo C. B., Dixon J., et al. , 2010. Bcl-6 and NF-κB cistromes mediate opposing regulation of the innate immune response. Genes Dev. 24: 2760–2765 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large-scale quality analysis of published ChIP-seq data

Affiliation

Large-scale quality analysis of published ChIP-seq data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous