Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec;18(1):2137659.
doi: 10.1080/15592294.2022.2137659. Epub 2022 Dec 20.

Uncertainty quantification of reference-based cellular deconvolution algorithms

Affiliations

Uncertainty quantification of reference-based cellular deconvolution algorithms

Dorothea Seiler Vellame et al. Epigenetics. 2023 Dec.

Abstract

The majority of epigenetic epidemiology studies to date have generated genome-wide profiles from bulk tissues (e.g., whole blood) however these are vulnerable to confounding from variation in cellular composition. Proxies for cellular composition can be mathematically derived from the bulk tissue profiles using a deconvolution algorithm; however, there is no method to assess the validity of these estimates for a dataset where the true cellular proportions are unknown. In this study, we describe, validate and characterize a sample level accuracy metric for derived cellular heterogeneity variables. The CETYGO score captures the deviation between a sample's DNA methylation profile and its expected profile given the estimated cellular proportions and cell type reference profiles. We demonstrate that the CETYGO score consistently distinguishes inaccurate and incomplete deconvolutions when applied to reconstructed whole blood profiles. By applying our novel metric to >6,300 empirical whole blood profiles, we find that estimating accurate cellular composition is influenced by both technical and biological variation. In particular, we show that when using a common reference panel for whole blood, less accurate estimates are generated for females, neonates, older individuals and smokers. Our results highlight the utility of a metric to assess the accuracy of cellular deconvolution, and describe how it can enhance studies of DNA methylation that are reliant on statistical proxies for cellular heterogeneity. To facilitate incorporating our methodology into existing pipelines, we have made it freely available as an R package (https://github.com/ds420/CETYGO).

Keywords: DNA methylation; Illumina EPIC array; cellular heterogeneity; epigenetic epidemiology; illumina 450K array.

PubMed Disclaimer

Conflict of interest statement

No potential conflict of interest was reported by the author(s).

Figures

Figure 1.
Figure 1.
CETYGO captures variation in accuracy of cellular deconvolution in whole blood. Line graphs plotting the error associated with estimating the cellular proportions of reconstructed whole blood profiles with increasing proportion of noise (x-axis). Where the y-axis presents A) the root mean square error (RMSE) between the fixed cellular proportions used to construct the whole blood profiles and the estimated proportions generated with Houseman’s method, B) the error metric CETYGO and C) the sum of all proportions estimated. The points represent the mean value and the dashed lines the 95% confidence intervals calculated across multiple simulations. The two lines represent simulations constructed from reference data generated from two different platforms, the Illumina 450K and EPIC BeadChip microarrays.
Figure 2.
Figure 2.
Cell type dependent effects on accuracy when omitted from reference based cellular deconvolution algorithms. Line graph of the error associated with estimating the cellular proportions of reconstructed whole blood profiles where the reference panel is missing one of six cell types. Each coloured line represents a different cell type being omitted from the reference panel, but included in the reconstructed whole blood profiles used for testing. Plotted is the proportion in the testing profile that the missing cell type is set to occupy (x-axis) against the error, measured using the CETYGO score, of the deconvolution (y-axis). The points represent the mean value and the dashed lines the 95% confidence intervals calculated across multiple simulations.
Figure 3.
Figure 3.
The accuracy of cellular heterogeneity estimation increases as the reference panel becomes more representative. Violin plots of the error associated with estimating the cellular proportions of reconstructed whole blood profiles where the reference panel is missing between one and three cell types. Each violin plot shows the distribution of the error, measured using CETYGO, of the deconvolution (y-axis) grouped by A) the number of cell types included in the reference panel and B) the proportion of cells in the reconstructed whole blood profile that are from cell types included in the reference panel.
Figure 4.
Figure 4.
The CETYGO score captures the tissue specificity of deconvolution reference panels. Violin plots of the error associated with estimating the cellular proportions where a reference panel consisting of six blood cell types was applied to 10,447 DNA methylation profiles, across 18 different datasets and 20 different sample types. Each violin plot shows the distribution of the error, measured using the CETYGO score, of the deconvolution (y-axis) grouped by the tissue/cell-type, where the violins are coloured to highlight which samples are derived from blood, which are human derived non-blood bulk tissue, and which are human derived cell-lines.
Figure 5.
Figure 5.
The CETYGO score correlates with metrics of data quality. Summaries of the error associated with estimating the cellular proportions as a function of quantitative metrics of DNA methylation array signal for 725 samples from Dataset 3. A) Violin plot of the distribution of the CETYGO score, grouped by whether the sample is of sufficient quality to pass the quality control pipeline. Scatterplots of the error, measured using the CETYGO score (y-axis) for each sample against, B) the median methylated (m) intensity across all sites on the microarray, C) the median unmethylated (u) intensity across all sites on the microarray, D) the bisulpfhite conversion % calculated as the mean across 10 fully methylated control probes. In panels, B, C and D, the points are coloured by whether the sample passed quality control in panel A or not.
Figure 6.
Figure 6.
Error in estimation of cellular heterogeneity from DNA methylation data correlates with error from epigenetic clock algorithms. Heatscatterplot of the error measured using the CETYGO score (y-axis), associated with estimating the cellular proportions across 6,351 whole blood profiles against the difference between the sample’s chronological age and age predicted using Horvaths pan-tissue algorithm from the DNA methylation data (delta age; x-axis). The colour of the points represents the density of points at that location.

References

    1. Gruzieva O, Xu CJ, Breton CV, et al. Epigenome-wide meta-analysis of methylation in children related to prenatal NO 2 air pollution exposure. Environ Health Perspect. 2017;125(1):104–15. - PMC - PubMed
    1. HANNON E, Knox O, Sugden K, et al. Characterizing genetic and environmental influences on variable DNA methylation using monozygotic and dizygotic twins. PLoS Genet. 2018;14(8):e1007544. - PMC - PubMed
    1. Joehanes R, JUST AC, Marioni RE, et al. Epigenetic Signatures of Cigarette Smoking. Circ Cardiovasc Genet. 2016;9(1):436–447. - PMC - PubMed
    1. Tobi EW, Goeman JJ, Monajemi R, et al. DNA methylation signatures link prenatal famine exposure to growth and metabolism. Nat Commun. 2014;5(1):5592. - PMC - PubMed
    1. Murphy TM, MILL J.. Epigenetics in health and disease: heralding the EWAS era. Lancet. 2014;383(9933):1952–1954. - PubMed

Publication types