Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 11;18(1):67.
doi: 10.1186/s13059-017-1189-z.

DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning

Affiliations

DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning

Christof Angermueller et al. Genome Biol. .

Erratum in

Abstract

Recent technological advances have enabled DNA methylation to be assayed at single-cell resolution. However, current protocols are limited by incomplete CpG coverage and hence methods to predict missing methylation states are critical to enable genome-wide analyses. We report DeepCpG, a computational approach based on deep neural networks to predict methylation states in single cells. We evaluate DeepCpG on single-cell methylation data from five cell types generated using alternative sequencing protocols. DeepCpG yields substantially more accurate predictions than previous methods. Additionally, we show that the model parameters can be interpreted, thereby providing insights into how sequence composition affects methylation variability.

Keywords: Artificial neural network; DNA methylation; Deep learning; Epigenetics; Machine learning; Single-cell genomics.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
DeepCpG model training and applications. a Sparse single-cell CpG profiles as obtained from scBS-seq [5] or scRRBS-seq [–8]. Methylated CpG sites are denoted by ones, un-methylated CpG sites by zeros, and question marks denote CpG sites with unknown methylation state (missing data). b Modular architecture of DeepCpG. The DNA module consists of two convolutional and pooling layers to identify predictive motifs from the local sequence context and one fully connected layer to model motif interactions. The CpG module scans the CpG neighbourhood of multiple cells (rows in b) using a bidirectional gated recurrent network (GRU) [36], yielding compressed features in a vector of constant size. The Joint module learns interactions between higher-level features derived from the DNA and CpG modules to predict methylation states in all cells. c, d The trained DeepCpG model can be used for different downstream analyses, including genome-wide imputation of missing CpG sites (c) and the discovery of DNA sequence motifs that are associated with DNA methylation levels or cell-to-cell variability (d)
Fig. 2
Fig. 2
DeepCpG accurately predicts single-cell CpG methylation states. a Genome-wide prediction performance for imputing CpG sites in 18 serum-grown mouse embryonic stem cells (mESCs) profiled using scBS-seq [5]. Performance is measured by the area under the receiver-operating characteristic curve (AUC), using holdout validation. Considered were DeepCpG and random forest classifiers trained either using DNA sequence and CpG features (RF) or using additional annotations from corresponding cell types (RF Zhang [12]). Additionally, two baseline methods were considered, which estimate methylation states by averaging observed methylation states, either across consecutive 3-kbp regions within individual cells (WinAvg [5]) or across cells at a single CpG site (CpGAvg). b Performance breakdown of DeepCpG and RF, comparing the full models to models trained using either only methylation features (DeepCpG CpG, RF CpG) or only DNA features (DeepCpG DNA, RF DNA). c AUC of the methods as in (a) stratified by genomic contexts with increasing CpG coverage across cells. Trend lines were fit using local polynomial regression (LOESS [72]); shaded areas denote 95% confidence intervals. d AUC for alternative sequence contexts with All corresponding to genome-wide performance as in (a). e Genome-wide prediction performance on 12 2i-grown mESCs profiled using scBS-seq [5], as well as three cell types profiled using scRRBS-seq [8], including 25 human hepatocellular carcinoma cells (HCC), six HepG2 cells, and six additional mESCs. CGI CpG island, LMR low-methylated region, TSS transcription start site
Fig. 3
Fig. 3
Discovered sequence motifs associated with DNA methylation. Clustering of 128 motifs discovered by DeepCpG. Shown are the first two principal components of the motif occurrence frequencies in sequence windows (activity). Triangles denote motifs with significant (FDR <0.05) similarity to annotated motifs in the CIS-BP [42] or UniPROPE [43] databases. Marker size indicates the average activity; the estimated motif effect on methylation level is shown by colour. Sequence logos are shown for representative motifs with larger effects, including ten annotated motifs
Fig. 4
Fig. 4
Effect of single-nucleotide mutations on DNA methylation. Average genome-wide effect of single-nucleotide mutations on DNA methylation estimated using DeepCpG, depending on the distance to the CpG site and genomic context. CGI CpG island, LMR low-methylated region, TSS transcription start site
Fig. 5
Fig. 5
Prediction of methylation variability from local DNA sequence. a Difference of motif effect on cell-to-cell variability and methylation levels for different genomic contexts. Motifs associated with increased cell-to-cell variability are highlighted in brown; motifs that are primarily associated with changes in methylation level are shown in purple. b Genome-wide correlation coefficients between motif activity and DNA sequence conservation (left), as well as cell-to-cell variability (right). c Sequence logos for selected motifs identified in (a), which are highlighted with coloured text in (b). d Boxplots of the predicted and the observed cell-to-cell variability for different genomic contexts on held-out test chromosomes (left), alongside Pearson and Kendall correlation coefficients within contexts (right). CGI CpG island, LMR low-methylated region, TSS transcription start site

References

    1. Robertson KD. DNA methylation and human disease. Nat Rev Genet. 2005;6:597–610. doi: 10.1038/nrg1655. - DOI - PubMed
    1. Suzuki MM, Bird A. DNA methylation landscapes: provocative insights from epigenomics. Nat Rev Genet. 2008;9:465–76. doi: 10.1038/nrg2341. - DOI - PubMed
    1. Laird PW. Principles and challenges of genome-wide DNA methylation analysis. Nat Rev Genet. 2010;11:191–203. doi: 10.1038/nrg2732. - DOI - PubMed
    1. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012;13:484–92. doi: 10.1038/nrg3230. - DOI - PubMed
    1. Smallwood SA, Lee HJ, Angermueller C, Krueger F, Saadeh H, Peat J, et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat Methods. 2014;11:817–20. doi: 10.1038/nmeth.3035. - DOI - PMC - PubMed