Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 6;21(2):e1012768.
doi: 10.1371/journal.pcbi.1012768. eCollection 2025 Feb.

A SuperLearner-based pipeline for the development of DNA methylation-derived predictors of phenotypic traits

Affiliations

A SuperLearner-based pipeline for the development of DNA methylation-derived predictors of phenotypic traits

Dennis Khodasevich et al. PLoS Comput Biol. .

Abstract

Background: DNA methylation (DNAm) provides a window to characterize the impacts of environmental exposures and the biological aging process. Epigenetic clocks are often trained on DNAm using penalized regression of CpG sites, but recent evidence suggests potential benefits of training epigenetic predictors on principal components.

Methodology/findings: We developed a pipeline to simultaneously train three epigenetic predictors; a traditional CpG Clock, a PCA Clock, and a SuperLearner PCA Clock (SL PCA). We gathered publicly available DNAm datasets to generate i) a novel childhood epigenetic clock, ii) a reconstructed Hannum adult blood clock, and iii) as a proof of concept, a predictor of polybrominated biphenyl exposure using the three developmental methodologies. We used correlation coefficients and median absolute error to assess fit between predicted and observed measures, as well as agreement between duplicates. The SL PCA clocks improved fit with observed phenotypes relative to the PCA clocks or CpG clocks across several datasets. We found evidence for higher agreement between duplicate samples run on alternate DNAm arrays when using SL PCA clocks relative to traditional methods. Analyses examining associations between relevant exposures and epigenetic age acceleration (EAA) produced more precise effect estimates when using predictions derived from SL PCA clocks.

Conclusions: We introduce a novel method for the development of DNAm-based predictors that combines the improved reliability conferred by training on principal components with advanced ensemble-based machine learning. Coupling SuperLearner with PCA in the predictor development process may be especially relevant for studies with longitudinal designs utilizing multiple array types, as well as for the development of predictors of more complex phenotypic traits.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Childhood clock primary summary.
Correlation coefficients and median absolute error (MAE) to chronological age for each childhood clock development method for the childhood training data for the traditional CpG clock (a), the PCA clock (b), and the SL PCA clock (c), CHAMACOS testing data (d–f), and agreement between alternate array replicates in CHAMACOS (g–i). Training data color corresponds to GEO dataset. The 1:1 line is shown in black.
Fig 2
Fig 2. Adult clock primary summary.
Correlation coefficients and median absolute error (MAE) to chronological age for each Hannum clock development method for the Hannum training data for the traditional CpG clock (a), the PCA clock (b), and the SL PCA clock (c), with the 1:1 line is shown in black. Correlation coefficients and MAE for the GSE84727 testing data (d–f) with color corresponding to schizophrenia case status, with red indicating schizophrenia case status and blue indicating a control.
Fig 3
Fig 3. Correlation coefficients and median absolute error (MAE) to observed log-transformed PBB concentrations for the training data with the traditional CpG predictor (a), the PCA predictor (b), and the SL PCA predictor (c), and the testing data (d–f).
The 1:1 line is shown in black.

References

    1. Colwell ML, Townsel C, Petroff RL, Goodrich JM, Dolinoy DC. Epigenetics and the Exposome: DNA methylation as a proxy for health impacts of prenatal environmental exposures. Exposome. 2023;3(1):osad001. doi: 10.1093/exposome/osad001 - DOI - PMC - PubMed
    1. Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013. Dec 10;14(10):3156. doi: 10.1186/gb-2013-14-10-r115 - DOI - PMC - PubMed
    1. Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, et al.. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell. 2013. Jan 24;49(2):359–67. doi: 10.1016/j.molcel.2012.10.016 - DOI - PMC - PubMed
    1. Bollepalli S, Korhonen T, Kaprio J, Anders S, Ollikainen M. EpiSmokEr: a robust classifier to determine smoking status from DNA methylation data. Epigenomics. 2019. Oct;11(13):1469–86. doi: 10.2217/epi-2019-0206 - DOI - PubMed
    1. Thompson M, Hill BL, Rakocz N, Chiang JN, Geschwind D, Sankararaman S. Methylation risk scores are associated with a collection of phenotypes within electronic health record systems. NPJ Genom Med. 2022. Aug 25;7(1):1–11. - PMC - PubMed