Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Aug 11:4:288.
doi: 10.1186/1756-0500-4-288.

Quantification of histone modification ChIP-seq enrichment for data mining and machine learning applications

Affiliations

Quantification of histone modification ChIP-seq enrichment for data mining and machine learning applications

Stephen A Hoang et al. BMC Res Notes. .

Abstract

Background: The advent of ChIP-seq technology has made the investigation of epigenetic regulatory networks a computationally tractable problem. Several groups have applied statistical computing methods to ChIP-seq datasets to gain insight into the epigenetic regulation of transcription. However, methods for estimating enrichment levels in ChIP-seq data for these computational studies are understudied and variable. Since the conclusions drawn from these data mining and machine learning applications strongly depend on the enrichment level inputs, a comparison of estimation methods with respect to the performance of statistical models should be made.

Results: Various methods were used to estimate the gene-wise ChIP-seq enrichment levels for 20 histone methylations and the histone variant H2A.Z. The Multivariate Adaptive Regression Splines (MARS) algorithm was applied for each estimation method using the estimation of enrichment levels as predictors and gene expression levels as responses. The methods used to estimate enrichment levels included tag counting and model-based methods that were applied to whole genes and specific gene regions. These methods were also applied to various sizes of estimation windows. The MARS model performance was assessed with the Generalized Cross-Validation Score (GCV). We determined that model-based methods of enrichment estimation that spatially weight enrichment based on average patterns provided an improvement over tag counting methods. Also, methods that included information across the entire gene body provided improvement over methods that focus on a specific sub-region of the gene (e.g., the 5' or 3' region).

Conclusion: The performance of data mining and machine learning methods when applied to histone modification ChIP-seq data can be improved by using data across the entire gene body, and incorporating the spatial distribution of enrichment. Refinement of enrichment estimation ultimately improved accuracy of model predictions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of enrichment estimation methods. Summary of the methods used to make single-value estimates of gene-wise ChIP-seq enrichment. The first column lists the enrichment estimation methods. The second column lists the window sizes for which each method is applied. The last column shows a graphical representation of the estimation region for each method/window size combination relative to the transcription start sites (TSS) and transcription end sites (TES) of genes.
Figure 2
Figure 2
Comparison of enrichment estimation methods by MARS model statistics. Plots of (A) GCV and (B) R-squared values for MARS models built with each enrichment estimation method. GCV scores are sorted in descending order; small GCV scores are indicative of superior model fit. R-squared values are sorted in ascending order; large R-squared values are indicative of superior model fit. Models based on whole-gene enrichment estimates group together as the best models by both metrics.
Figure 3
Figure 3
Average histone modification enrichments stratified by gene length. Plots of average enrichment profiles from the transcription start site to 6000 bp into the gene body for H3K36me3 (A), H3K79me2 (B), and H3K79me3 (C), stratified by quintiles of gene length. The variability in slope for each of these marks suggests that the enrichment pattern for each of these marks scale with gene length. For example, for the smallest 20%-ile of genes, H3K36me3 enrichment rapidly rises from the TSS to 6000 bp into the gene body; however, for each successive 20%-ile of increasing gene length, the rate of increase in enrichment is diminished for the same region.

Similar articles

Cited by

References

    1. Strahl BD, Allis CD. The language of covalent histone modifications. Nature. 2000;403(6765):41–45. doi: 10.1038/47412. - DOI - PubMed
    1. Jenuwein T, Allis CD. Translating the histone code. Science. 2001;293(5532):1074–1080. doi: 10.1126/science.1063127. - DOI - PubMed
    1. Goldberg AD, Allis CD, Bernstein E. Epigenetics: a landscape takes shape. Cell. 2007;128(4):635–638. doi: 10.1016/j.cell.2007.02.006. - DOI - PubMed
    1. Latham JA, Dent SY. Cross-regulation of histone modifications. Nat Struct Mol Biol. 2007;14(11):1017–1024. doi: 10.1038/nsmb1307. - DOI - PubMed
    1. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ. et al.Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008;40(7):897–903. doi: 10.1038/ng.154. - DOI - PMC - PubMed

LinkOut - more resources