Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 16;18(2):420.
doi: 10.3390/ijms18020420.

Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human

Affiliations

Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human

Chengchao Wu et al. Int J Mol Sci. .

Abstract

DNA methylation plays a significant role in transcriptional regulation by repressing activity. Change of the DNA methylation level is an important factor affecting the expression of target genes and downstream phenotypes. Because current experimental technologies can only assay a small proportion of CpG sites in the human genome, it is urgent to develop reliable computational models for predicting genome-wide DNA methylation. Here, we proposed a novel algorithm that accurately extracted sequence complexity features (seven features) and developed a support-vector-machine-based prediction model with integration of the reported DNA composition features (trinucleotide frequency and GC content, 65 features) by utilizing the methylation profiles of embryonic stem cells in human. The prediction results from 22 human chromosomes with size-varied windows showed that the 600-bp window achieved the best average accuracy of 94.7%. Moreover, comparisons with two existing methods further showed the superiority of our model, and cross-species predictions on mouse data also demonstrated that our model has certain generalization ability. Finally, a statistical test of the experimental data and the predicted data on functional regions annotated by ChromHMM found that six out of 10 regions were consistent, which implies reliable prediction of unassayed CpG sites. Accordingly, we believe that our novel model will be useful and reliable in predicting DNA methylation.

Keywords: DNA methylation; predicted model; sequence complexity.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
DNA methylation patterns and the heat map of overall prediction accuracies: (A) bimodal distributions of DNA methylation patterns; and (B) the heat map of prediction results for combinations of different chromosomes and different window sizes.
Figure 2
Figure 2
Receive Operating Characteristic (ROC) curves of different comparisons: (A) ROC curves of comparisons between two groups of features using 10-fold cross-validation; (B) ROC curves of comparisons between the top 24 important features and remaining 48 features using independent testing; (C) ROC curves of comparisons between four common classifiers using independent testing; and (D) ROC curves of three mouse chromosome predictions.
Figure 3
Figure 3
Top 24 important features by normalized regression coefficients in linear kernel SVM. The importance of the features was obtained by resampling statistics and the corresponding error bars of the top 24 features are represented. The colors represent different groups of features: DNA composition is blue, and sequence complexity is red.
Figure 4
Figure 4
Statistical tests of the experimental data and the predicted data on 10 functional regions: (A) the computing procedure for one of 10 regions (Strong Transcription); and (B) the semi-violin plots show the distributions of average DNA methylation levels on 10 functional genomic regions. The p values of the six regions confirmed to be statistically consistent are labeled, and *** represents p < 0.0001.
Figure 5
Figure 5
Distribution differences between methylated samples and unmethylated samples on two SC features: (A) box-plots of the distributions of methylated samples and unmethylated samples on SC-1 feature and corresponding statistical test (p-value = 0.00034); and (B) box-plots of the distributions of methylated samples and unmethylated samples on SC-2 feature and corresponding statistical test (p-value = 0.00093).
Figure 6
Figure 6
The determination approach of the exponentially increasing part of the complexity function graph: (A) The complexity function graph of 100-bp window DNA fragment with a methylated CpG site in the center. The red point represents the corresponding point of complexity function, the dark red point represents the point of the topological entropy point, and the blue point represents the corresponding point of two difference operations of complexity function; (B) The box plots of the distributions of EIPLP for different window sizes.

References

    1. Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. - DOI - PMC - PubMed
    1. Smith Z.D., Meissner A. DNA methylation: Roles in mammalian development. Nat. Rev. Genet. 2013;14:204–220. doi: 10.1038/nrg3354. - DOI - PubMed
    1. Law J.A., Jacobsen S.E. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat. Rev. Genet. 2010;11:204–220. doi: 10.1038/nrg2719. - DOI - PMC - PubMed
    1. Larsen F., Gundersen G., Lopez R., Prydz H. CpG islands as gene markers in the human genome. Genomics. 1992;13:1095–1107. doi: 10.1016/0888-7543(92)90024-M. - DOI - PubMed
    1. Cedar H., Bergman Y. Programming of DNA methylation patterns. Annu. Rev. Biochem. 2012;81:97–117. doi: 10.1146/annurev-biochem-052610-091920. - DOI - PubMed

LinkOut - more resources