Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 23;16(9):e1008270.
doi: 10.1371/journal.pcbi.1008270. eCollection 2020 Sep.

Epiclomal: Probabilistic clustering of sparse single-cell DNA methylation data

Affiliations

Epiclomal: Probabilistic clustering of sparse single-cell DNA methylation data

Camila P E de Souza et al. PLoS Comput Biol. .

Abstract

We present Epiclomal, a probabilistic clustering method arising from a hierarchical mixture model to simultaneously cluster sparse single-cell DNA methylation data and impute missing values. Using synthetic and published single-cell CpG datasets, we show that Epiclomal outperforms non-probabilistic methods and can handle the inherent missing data characteristic that dominates single-cell CpG genome sequences. Using newly generated single-cell 5mCpG sequencing data, we show that Epiclomal discovers sub-clonal methylation patterns in aneuploid tumour genomes, thus defining epiclones that can match or transcend copy number-determined clonal lineages and opening up an important form of clonal analysis in cancer. Epiclomal is written in R and Python and is available at https://github.com/shahcompbio/Epiclomal.

PubMed Disclaimer

Conflict of interest statement

S.A. and S.P.S are cofounders and consultants to Canexia Health Inc. S.A is a consultant to Sangamo Pharmaceuticals and Repare Therapeutics. Author Emma Laks was unable to confirm their authorship contributions. On their behalf, the corresponding author has reported their contributions to the best of their knowledge.

Figures

Fig 1
Fig 1. (a) EpiclomalBasic and (b) EpiclomalRegion graphical models.
In (a), the shaded node Xnm denotes the observed methylation state at CpG site m of cell n. In (b), we take into account the region location of each CpG and let the shaded node Xnrl denote the observed methylation state at CpG site l of region r of cell n. Both Xnm and Xnrl take values in S={unmethylated,methylated} or simply S={0,1}. In (a) and (b), the unshaded Zn node corresponds to the latent variable (with a value in {1, …, K}) indicating the true cluster population (epiclone) for cell n. The Gkm and Gkrl unshaded nodes in (a) and (b) respectively are the latent variables with values in S that correspond to the true hidden CpG epigenotypes for each epiclone k. The unshaded μ, π, and ϵ nodes in both (a) and (b) correspond to the unknown model parameters, which under the Bayesian paradigm have prior distributions with fixed hyperparameters described by the shaded nodes with the 0 superscript. The distribution assumed for each variable or parameter is written within its node. The edges of the graphs depict dependencies. The plates depict repetitions. In EpiclomalBasic (a), true hidden epigenotypes share the same probability distribution across all CpG sites in the same epiclone (Gkm ∼ Bernoulli(μk)). In EpiclomalRegion (b), true hidden epigenotypes follow a Bernoulli distribution with probability parameters that vary across regions (Gkrl ∼ Bernoulli(μkr)).
Fig 2
Fig 2. The three components of our proposed framework.
Input data and pre-processing: data from regions of interest are extracted from methylation call files, which can be filtered to keep only data from regions with a desired amount of missing data and methylation level IQR. A synthetic data pipeline is also provided to simulate data under different parameters. Clustering: cells are clustered using different non-probabilistic clustering methods, with results that will then be used as initial values for Epiclomal methods. Output and performance measures: different metrics are provided to evaluate the output of each method when true cluster assignments are known.
Fig 3
Fig 3. Simulation results when varying the missing data proportion.
We report mean results produced by Epiclomal and the non-probabilistic methods taken over 30 randomly generated synthetic datasets: (a) V-measure; (b) Number of predicted clusters (true is 3); the top panel shows the proportion of data sets for which a method failed to produce a result; (c) Epiclone frequency (prevalence) MAE (mean absolute error); (d) Uncertainty true positive rate; and (e) Hamming distance for three variants of EpiclomalRegion inferred methylation states: unadjusted, adjusted, and naive (see Sections 1.2 and 3.4 in S1 Material). The vertical bars correspond to one standard deviation above and below the mean value.
Fig 4
Fig 4. Predicted cell-to-cluster assignments on synthetic data.
We report mean V-measures produced by Epiclomal and the non-probabilistic methods taken over 30 randomly generated synthetic data sets, when we vary by: (a) the number of regions, (b) the number of cells, (c) the cell-to-cell variability, (d) the number of clones, (e) the cluster frequencies (prevalences), and (f) the number of loci. The vertical bars correspond to one standard deviation above and below the mean value. The Epiclomal methods outperformed the other methods in all cases.
Fig 5
Fig 5. Imputation results on synthetic data.
Average hamming distance for three variants of EpiclomalRegion inferred methylation states: unadjusted, adjusted, and naive (see Sections 1.2 and 3.4 in S1 Material) when varying: (a) the number of regions, (b) the number of cells, (c) the cell-to-cell variability, (d) the number of clones, (e) the cluster frequencies (prevalences), and (f) the number of loci. The vertical bars correspond to one standard deviation above and below the mean value.
Fig 6
Fig 6. Results on the real data sets.
(a) Dimensionality reduction visualization plots showing the clustering reported in the published papers on the ≈ 10,000 loci processed data sets. (b) Co-clustering between the real data published clusters on the rows and EpiclomalRegion predictions on the columns. Each entry aij is the percentage of cells in published class i that are present in predicted cluster j, with the rows summing up to 100%. A perfect agreement would result in a square matrix with a black diagonal. (c) V-measures comparing the cell assignments with the published assignments, with higher values meaning better agreement. (d) Cluster frequencies mean absolute error, comparing the inferred proportions of clusters with the published proportions, with lower values meaning better agreement. (e) Number of predicted clusters. The horizontal dashed lines correspond to the published number of clusters; bars closer to this line represent better agreement.
Fig 7
Fig 7. Visualization of the InHouse clusters.
(a) EpiclomalRegion clustering, with data filtered to include the most variable CGIs and obtain ≈ 15,000 loci (327 CGIs, cell average missing proportion 0.82, 558 cells). EpiclomalRegion obtained 3 clusters. Rows are cells, and columns are CGIs. (b) tSNE dimensionality reduction and color-coding of the Epiclomal clusters onto the tSNE 2-dimensional space.
Fig 8
Fig 8. Results for patient SA501.
(a) Mean methylation level for each of the 94 NMF-selected regions (CGIs) for patient SA501 across all cells ordered according to the four methylation clusters found using EpiclomalRegion. Rows are cells, and columns are CGIs. (b) Inferred genome-wide copy numbers for the same cells as in (a) clustered using a ward.D2 hierarchical clustering method and Euclidean copy number distances. Note that copy number 5 means five or more copies. To call copy number changes, we used the methylation sc-WGBS data. Only one epiclone and one copy number clone matched, the remaining clones transcended each other. (c) Pearson correlation between mean methylation data and copy number data in each of the 94 regions. There was correlation in chromosome X, but not in the autosomal chromosomes. (d) Heatmap showing the percentage of cells in the copy number clusters (rows) that are in the methylation clusters (columns); rows sum up to 100.

References

    1. Smith ZD, Meissner A. DNA methylation: roles in mammalian development. Nature Reviews Genetics. 2013;14(3):204 10.1038/nrg3354 - DOI - PubMed
    1. Feng S, Jacobsen SE, Reik W. Epigenetic reprogramming in plant and animal development. Science. 2010;330(6004):622–627. 10.1126/science.1190614 - DOI - PMC - PubMed
    1. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472(7341):90 10.1038/nature09807 - DOI - PMC - PubMed
    1. Zahn H, Steif A, Laks E, Eirew P, VanInsberghe M, Shah SP, et al. Scalable whole-genome single-cell library preparation without preamplification. Nature methods. 2017;14(2):167 10.1038/nmeth.4140 - DOI - PubMed
    1. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics. 2013;14(9):618 10.1038/nrg3542 - DOI - PubMed

Publication types