Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar 31;11(3):Article 9.
doi: 10.1515/1544-6115.1750.

Normalization, bias correction, and peak calling for ChIP-seq

Affiliations

Normalization, bias correction, and peak calling for ChIP-seq

Aaron Diaz et al. Stat Appl Genet Mol Biol. .

Abstract

Next-generation sequencing is rapidly transforming our ability to profile the transcriptional, genetic, and epigenetic states of a cell. In particular, sequencing DNA from the immunoprecipitation of protein-DNA complexes (ChIP-seq) and methylated DNA (MeDIP-seq) can reveal the locations of protein binding sites and epigenetic modifications. These approaches contain numerous biases which may significantly influence the interpretation of the resulting data. Rigorous computational methods for detecting and removing such biases are still lacking. Also, multi-sample normalization still remains an important open problem. This theoretical paper systematically characterizes the biases and properties of ChIP-seq data by comparing 62 separate publicly available datasets, using rigorous statistical models and signal processing techniques. Statistical methods for separating ChIP-seq signal from background noise, as well as correcting enrichment test statistics for sequence-dependent and sonication biases, are presented. Our method effectively separates reads into signal and background components prior to normalization, improving the signal-to-noise ratio. Moreover, most peak callers currently use a generic null model which suffers from low specificity at the sensitivity level requisite for detecting subtle, but true, ChIP enrichment. The proposed method of determining a cell type-specific null model, which accounts for cell type-specific biases, is shown to be capable of achieving a lower false discovery rate at a given significance threshold than current methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1. ChIP-seq and MeDIP-seq
Chromatin is randomly sheared with high frequency sound waves (sonication) or digested with micrococcal nuclease (MNase). The desired Protein:DNA complex is then isolated with an antibody (yellow Y). The ends of purified DNA are then sequenced (ChIP-seq). Similarly, MeDIP-seq uses an antibody against methyl cytosine, followed by deep sequencing. Mapping the resulting short reads to the reference human genome then provides information about which genomic loci were modified or bound by a TF.
Figure 2
Figure 2. Comparison of scaling methods
(a) Scaling IP (top row) and Input (bottom row) samples to equalize the read counts only in the background (enclosed by parentheses) preserves the statistical significance of the IP peak shown. (b) On the other hand, forcing the total number of reads to be equal between IP and Input would artificially redistribute the counts that accumulated within the IP peak to background regions, thus inflating the noise level in Input. Some true peaks can be lost in this process.
Figure 3
Figure 3. Signal extraction scaling algorithm
We maximize the difference between the cumulative percentage tag allocation in Input (red) and IP (blue), over all partitions of the genome ordered by the IP read density. The maximizing index divides the genome into two sets of loci: high tag count bins for which, on average, percentage IP tag density will exceed percentage Input tag density, and low tag count background bins for which the opposite is true.
Figure 4
Figure 4. SES recovers SDS false negatives
sp100 is expressed at 2.4 fold above the median expression level of all genes in NSC. Yet, under SDS the promoter region shows no statistically significant enrichment of H3K4me3 IP over Input, despite it being an epigenetic mark of active transcription. Under SES, the sp110 promoter region is detected as methylated. Note that p-values were not computed at positions where Input density exceeds IP density.
Figure 5
Figure 5. SES preserves subtle H3K4me3 peaks
The majority of the H3K4me3 peaks detected by our method, but missed by PeakSeq, are found near transcription start sites (TSS) of known genes. The corresponding genes also show significantly higher expression levels compared to the genes that do not have any H3K4me3 peak (Wilcoxon test p-value = 2.0 × 10−81).
Figure 6
Figure 6. SES preserves subtle H3K27me3 peaks
The majority of the H3K27me3 peaks detected by our method, but missed by PeakSeq, are found near transcription start sites (TSS) of known genes. The corresponding genes also show significantly lower expression levels compared to the genes that have H3K4me3 peaks (Wilcoxon test p-value = 1.3 × 10−25).
Figure 7
Figure 7. Frequency correlation and spectral energy for ChIP-seq replicates
Level 15 Coiflet wavelet decompositions were performed on the alignment densities for each replicate experiment pair in the ENCODE Yale TF Input dataset. At each level we computed the Pearson correlations of the detail coefficients, as well as their spectral energy. (A) The average correlation over all datasets, at a given wavelet level. Over all datasets, over all levels, the mean correlation was 0.89 with a standard deviation of 0.19. (B) Correlation between replicates at a given wavelet decomposition level. (C) Percentage energy allocated to a given level, averaged over all experiments. Each box summarizes the distribution of spectral energies across datasets, at a given level.
Figure 8
Figure 8. FDRs for the Poisson and ZINB models
FDR as a function of p-value cutoff was estimated from the distribution of p-values produced by both the Poisson and ZINB models using a beta-uniform mixture model. The ZINB model exhibits a lower FDR than the Poisson model at a given p-value cutoff.
Figure 9
Figure 9. Comparison of the distributions of (A) GC content and (B) mappability in peaks unique to individual algorithm
Peaks unique to our method generally show lower GC content and mappability. The statistical significance of the difference in distribution was assessed by the Wilcoxan rank sum test.

References

    1. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology. 2011;12:R18. - PMC - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biology. 2010;11:R106. - PMC - PubMed
    1. Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. - PMC - PubMed
    1. Burrows PM. Expected selection differentials for directional selection. Biometrics. 1971;28:2091–2110. - PubMed
    1. Cameron CA, Tricedi PK. Regression Analysis for Count Data. Cambridge; 1998.

Publication types

LinkOut - more resources