Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jan 24;16(1):14.
doi: 10.1186/s13059-015-0581-9.

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements

Affiliations

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements

Weiwei Zhang et al. Genome Biol. .

Abstract

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions.

Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels.

Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Correlation of methylation levels between neighboring CpG sites. The x-axis represents the genomic distance in bases between the neighboring CpG sites, or assayed CpG sites that are adjacent in the genome. Different colors and points represent subsets of the CpG sites genome-wide, including pairs of CpG sites that are not adjacent in the genome but that are the specified distance apart (non-adjacent). The CGI shore and shelf CpG sites are truncated at 4,000 bp, which is the length of the CGI shore and shelf regions. The solid horizontal line represents the background (absolute value correlation or mean squared Euclidean distance, MED) level from 50,000 pairs of CpG sites from different chromosomes. (A) Absolute value of the correlation between neighboring sites across all individuals (y-axis). The lines represent cubic smoothing splines fitted to the correlation data. (B) Median MED was calculated (y-axis) across pairs of CpG sites within the genomic distance window (x-axis). bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.
Figure 2
Figure 2
Histogram of correlation and MED of methylation values between pairs of CpG sites. The x-axes represent the correlation or MED of methylation values between pairs of CpG sites; the left column plots show the histogram of correlation of CpG sites within 200 kb (A), 1 Mb (C) and 10 Mb (E); the right column plots show the histogram of MED of CpG sites within 200 kb (B), 1 Mb (D) and 10 Mb (F). The distribution of the background is calculated by 50,000 random selected pairs of CpG sites and is shown in blue; The distributions of correlation and MED with corresponding distances are shown in pink. dist, distance; kb, kilobase, MB, megabase; MED, mean squared Euclidean distance.
Figure 3
Figure 3
Methylation levels, correlation within CGI. Since each CGI is a different length, each CGI was split into 40 equal-sized windows, and methylation levels and correlation were averaged within each window. (A) The x-axis is the mean β value within a window in the CGI, CGI shore, or CGI shelf regions across all sites in all individuals with a window size of 100 bp. (B) Methylation values of each CpG site in a CGI, CGI shore or CGI shelf Oxford, were compared with all other sites in the same CGI using MED. The x-axis and y-axis represent the genomic position of each CGI with a scale of 1:100, i.e. one unit in the matrix represents 100 bp distance. The MED of each unit cell was calculated for all pair-wise CpG sites corresponding to that matrix position and averaged over the 100 individuals. bp, base pair; CGI, CpG island; MED, mean squared Euclidean distance.
Figure 4
Figure 4
Correlation matrix of prediction features with first ten PCs of methylation levels. The x-axis corresponds to one of the 122 features; the y-axis represents PCs 1 through 10. Colors correspond to Pearson’s correlation, as shown in the legend. PC, principal component.
Figure 5
Figure 5
Prediction performance of methylation status and level. (A) ROC curves of cross-genome validation of methylation status prediction. Colors represent classifier trained using feature combinations specified in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (B) ROC curves for different classifiers. Colors represent prediction for a classifier denoted in the legend. Each ROC curve represents the average false positive rate and true positive rate for prediction on the held-out sets for each of the ten repeated random subsamples. (C) Precision–recall curves for region-specific methylation status prediction. Colors represent prediction on CpG sites within specific genomic regions as denoted in the legend. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. (D) Two-dimensional histogram of predicted methylation levels versus experimental methylation levels. x- and y-axes represent assayed versus predicted β values, respectively. Colors represent the density of each matrix unit, averaged over all predictions for 100 individuals. CGI, CpG island; Gene_pos, genomic position; k-NN, k-nearest neighbors classifier; ROC, receiver operating characteristic; seq_property, sequence properties; SVM, support vector machine; TFBS, transcription factor binding site; HM, histone modification marks; ChromHMM, chromatin states, as defined by ChromHMM software [107].
Figure 6
Figure 6
Prediction performance on WGBS data and cross-platform prediction. Precision–recall curves for cross-platform and WGBS prediction. Each precision–recall curve represents the average precision–recall for prediction on the held-out sets for each of the ten repeated random subsamples. WGBS, whole-genome bisulfite sequencing.
Figure 7
Figure 7
Top 20 most important features by Gini index. Gini index of the top 20 features for prediction in different genomic regions. Colors represent different types of features: neighbors in red, genomic position in green, sequence properties in blue and CREs in black. (A) Gini index for whole-genome prediction. (B) Gini index for prediction in promoter regions. (C) Gini index for prediction in CGIs. CGI, CpG island; CRE, cis-regulatory element; DHS, DNAse I hypersensitive; UpMethyl, upstream CpG site; DownMethyl, downstream CpG site; UpDist, distance in bases to the upstream CpG site; DownDist, distance in bases to the downstream CpG site.

References

    1. Barrero MJ, Boué S. Izpisúa Belmonte JC. Epigenetic mechanisms that regulate cell identity. Cell Stem Cell. 2010;7:565–70. doi: 10.1016/j.stem.2010.10.009. - DOI - PubMed
    1. Scarano MI, Strazzullo M, Matarazzo MR, D’Esposito M. DNA methylation 40 years later: Its role in human health and disease. J Cell Physiol. 2005;204:21–35. doi: 10.1002/jcp.20280. - DOI - PubMed
    1. Cedar H, Bergman Y. Programming of DNA methylation patterns. Annu Rev Biochem. 2012;81:97–117. doi: 10.1146/annurev-biochem-052610-091920. - DOI - PubMed
    1. Kiefer JC. Epigenetics in development. Dev Dyn. 2007;236:1144–56. doi: 10.1002/dvdy.21094. - DOI - PubMed
    1. Tost J. DNA methylation: an introduction to the biology and the disease-associated changes of a promising biomarker. Mol Biotechnol. 2010;44:71–81. doi: 10.1007/s12033-009-9216-2. - DOI - PubMed

Publication types