Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 13;15(1):114.
doi: 10.1186/s13148-023-01528-3.

Stability selection enhances feature selection and enables accurate prediction of gestational age using only five DNA methylation sites

Affiliations

Stability selection enhances feature selection and enables accurate prediction of gestational age using only five DNA methylation sites

Kristine L Haftorn et al. Clin Epigenetics. .

Abstract

Background: DNA methylation (DNAm) is robustly associated with chronological age in children and adults, and gestational age (GA) in newborns. This property has enabled the development of several epigenetic clocks that can accurately predict chronological age and GA. However, the lack of overlap in predictive CpGs across different epigenetic clocks remains elusive. Our main aim was therefore to identify and characterize CpGs that are stably predictive of GA.

Results: We applied a statistical approach called 'stability selection' to DNAm data from 2138 newborns in the Norwegian Mother, Father, and Child Cohort study. Stability selection combines subsampling with variable selection to restrict the number of false discoveries in the set of selected variables. Twenty-four CpGs were identified as being stably predictive of GA. Intriguingly, only up to 10% of the CpGs in previous GA clocks were found to be stably selected. Based on these results, we used generalized additive model regression to develop a new GA clock consisting of only five CpGs, which showed a similar predictive performance as previous GA clocks (R2 = 0.674, median absolute deviation = 4.4 days). These CpGs were in or near genes and regulatory regions involved in immune responses, metabolism, and developmental processes. Furthermore, accounting for nonlinear associations improved prediction performance in preterm newborns.

Conclusion: We present a methodological framework for feature selection that is broadly applicable to any trait that can be predicted from DNAm data. We demonstrate its utility by identifying CpGs that are highly predictive of GA and present a new and highly performant GA clock based on only five CpGs that is more amenable to a clinical setting.

Keywords: Cord blood; DNA methylation; Epigenetic clock; Epigenetics; Feature selection; Gestational age; Illumina MethylationEPIC BeadChip; MBRN; MoBa; Stability selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Selection probability of each CpG for the prediction of GA in cord-blood DNAm samples of newborns in MoBa (n = 2138). Each point represents a single CpG (n = 769,139). The x-axis displays the CpGs according to their genomic coordinate, while the y-axis represents the selection probability calculated from the stability selection analysis. The solid horizontal line denotes a selection probability of 0.5, where a given CpG has an equal probability of being selected or excluded. The dashed black line denotes the selection probability threshold of 0.73. Asterisks signify CpGs that were selected in previously published GA clocks (specifically, the Haftorn clock [9], the Bohlin clock [7], or the Knight clock [8]). Orange signifies a CpG with a selection probability above the threshold of 0.73, and blue signifies a CpG from a previously published clock with a selection probability below that threshold
Fig. 2
Fig. 2
Selection probability of CpGs in our analyses that were selected for being predictive in three previously published GA clocks. a The CpGs that were selected in the Haftorn clock (n = 176), b the CpGs that were selected in the Bohlin clock (n = 86), and panel c shows the CpGs that were selected in the Knight clock (n = 140). In each panel, the x-axis displays the beta coefficient for each CpG from the prediction model multiplied by the variance of DNAm in our samples, while the y-axis represents the selection probability calculated from the stability selection analysis. The solid horizontal line denotes a selection probability of 0.5 (i.e., a given CpG has an equal probability of being selected or excluded). The dashed black line denotes the selection probability threshold of 0.73. Orange signifies a selection probability above the threshold of 0.73, and blue signifies a clock-CpG with a selection probability below that threshold
Fig. 3
Fig. 3
The relationship between DNAm level and GA for each of the 15 stably selected CpGs in the training set (n = 1709). In each of the panels (ao), ultrasound-estimated GA (x-axis) is plotted against the DNAm level (β-value) (y-axis) for a given CpG. The orange line indicates the generalized additive model (GAM) regression of DNAm level on ultrasound-estimated GA. Orange CpG titles in panels a-e signify CpGs in the ‘5 stable CpG GA clock’
Fig. 4
Fig. 4
The relationship between the number of CpGs used for prediction and predictive performance in the test set (n = 429). Panel a shows the R2 for each of the clocks and panel b shows the corresponding MAD in days. The red dot in each panel shows the predictive performance of a clock developed using the standard framework with lasso
Fig. 5
Fig. 5
Prediction of GA in the test set (n = 429). a The scatter plot of GA predicted by DNAm against GA estimated by ultrasound for the ‘5 stable CpG GA clock.’ b The corresponding predictions for the ‘15 stable CpG GA clock.’ The orange diagonal line indicates the MM-type robust regression of ultrasound-estimated GA on DNAm-estimated GA
Fig. 6
Fig. 6
Prediction of GA using a GAM model versus a lasso model. Regression lines showing the relationship between ultrasound-estimated GA and predicted GA in the test set (n = 29) using a GAM model including 15 CpGs (orange line) and a lasso model including 233 CpGs (blue line). The black line indicates the ideal fit between ultrasound-estimated GA and DNAm-predicted GA
Fig. 7
Fig. 7
An illustrative example of the regulation map for cg18183624 on chromosome 17. The CpG, shown in red, is encompassed by the regulatory region ENSR00000095417 (blue-colored vertical bar). Below the regulatory region, all the genes are marked as black rectangles and those controlled by ENSR00000095417 are labeled by their gene symbols. The curves underneath the ideogram represent regulatory relationships between ENSR00000095417 and the genes, as predicted by GeneHancer
Fig. 8
Fig. 8
Overview of sample selection and analysis flow. Datasets are highlighted in green, methods in blue, analysis output in orange, and epigenetic clocks in yellow. Two randomly sampled subsets from MoBa (dataset 1 and dataset 2) were included in the current study. Data from four individuals that were present in both datasets were excluded from dataset 2. The two datasets were then merged into a single dataset (‘combined dataset’). The samples from the combined dataset were randomly assigned to a training and test set. Stability selection was performed both on the combined dataset and the training set. Generalized additive model (GAM) regression was used to model the effect of the stably selected CpGs on gestational age (GA) to build clocks based on the stably selected CpGs. In parallel, lasso regression was performed directly on the training set to build a standard GA clock. The standard GA clock and the clocks based on the stably selected CpGs were used to predict GA in the test set

Similar articles

Cited by

References

    1. Wang K, Liu H, Hu Q, Wang L, Liu J, Zheng Z, et al. Epigenetic regulation of aging: implications for interventions of aging and diseases. Signal Transduct Target Ther. 2022;7(1):374. doi: 10.1038/s41392-022-01211-8. - DOI - PMC - PubMed
    1. John RM, Rougeulle C. Developmental epigenetics: phenotype and the flexible epigenome. Front Cell Dev Biol. 2018;6:130. doi: 10.3389/fcell.2018.00130. - DOI - PMC - PubMed
    1. Villicaña S, Bell JT. Genetic impacts on DNA methylation: research findings and future perspectives. Genome Biol. 2021;22(1):127. doi: 10.1186/s13059-021-02347-6. - DOI - PMC - PubMed
    1. Merid SK, Novoloaca A, Sharp GC, Küpers LK, Kho AT, Roy R, et al. Epigenome-wide meta-analysis of blood DNA methylation in newborns and children identifies numerous loci related to gestational age. Genome Med. 2020;12(1):25. doi: 10.1186/s13073-020-0716-9. - DOI - PMC - PubMed
    1. Day K, Waite LL, Thalacker-Mercer A, West A, Bamman MM, Brooks JD, et al. Differential DNA methylation with age displays both common and dynamic features across human tissues that are influenced by CpG landscape. Genome Biol. 2013;14(9):R102. doi: 10.1186/gb-2013-14-9-r102. - DOI - PMC - PubMed

Publication types

LinkOut - more resources