Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr;5(4):709-719.
doi: 10.1038/s43587-024-00794-x. Epub 2025 Jan 13.

Somatic mutation as an explanation for epigenetic aging

Affiliations

Somatic mutation as an explanation for epigenetic aging

Zane Koch et al. Nat Aging. 2025 Apr.

Abstract

DNA methylation marks have recently been used to build models known as epigenetic clocks, which predict calendar age. As methylation of cytosine promotes C-to-T mutations, we hypothesized that the methylation changes observed with age should reflect the accrual of somatic mutations, and the two should yield analogous aging estimates. In an analysis of multimodal data from 9,331 human individuals, we found that CpG mutations indeed coincide with changes in methylation, not only at the mutated site but with pervasive remodeling of the methylome out to ±10 kilobases. This one-to-many mapping allows mutation-based predictions of age that agree with epigenetic clocks, including which individuals are aging more rapidly or slowly than expected. Moreover, genomic loci where mutations accumulate with age also tend to have methylation patterns that are especially predictive of age. These results suggest a close coupling between the accumulation of sporadic somatic mutations and the widespread changes in methylation observed over the course of life.

PubMed Disclaimer

Conflict of interest statement

Competing interests: T.I. is a cofounder of Serinus and Data4Cure, is on their scientific advisory boards and has an equity interest in both companies. T.I. is on the scientific advisory board of IDEAYA Biosciences and has an equity interest. The terms of these arrangements have been reviewed and approved by the University of California, San Diego, in accordance with its conflict of interest policies. The other authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Links among CpG mutations, methylome remodeling, and aging.
a) Various mutational processes affect the genome. Here, we show that some of these mutations associate with an aberrant DNA methylation pattern at both the mutated site and at numerous neighboring CpGs. b) An individual’s DNA mutation profile and DNA methylation profile make similar predictions of their calendar age and rate of aging. Panel a created with BioRender.com.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Supplemental characterization of CpG mutations.
a) The distribution of methylation fraction values of each CpG site in the TCGA and PCAWG datasets separately (TCGA = 273,202 and PCAWG = 326,749 CpG sites) in each sample (TCGA = 8,680 and PCAWG = 651 samples). b) The CpG density (number of CpGs per base pair) in the 50 and 125 base pairs surrounding each of the CpG sites in (a). The central line of the inner boxplot represents the median, the edges of the box the interquartile range (IQR), and the whiskers 1.5-times the IQR. c) Violin plots of the distribution of mean methylation fraction of non-mutated individuals at the same mutated CpG sites as in Fig. 1d (n = 8,037 sites), stratified by CpG mutation type. d) As in (c), but the distribution of CpG density in the 125 bp surrounding each CpG site. e) Pie chart showing the proportion of CpG mutations (n = 467,079 mutations) that result in specific mutated nucleotides. Note that 5’-CpG-3’ sites are palindromic, corresponding to a 3’-GpC-5’ sequence on the opposite strand; thus, mutation of the C residue is equivalent to mutation of the complementary G residue. For simplicity, we refer to all CpG mutations by the status of the C residue. f) Violin plot showing the mean methylation fraction across all PCAWG samples, considering CpG sites where a mutation has occurred in at least one sample (left, n = 1,137 CpG sites), CpG sites where no mutation has occurred in any sample (middle, n = 325,614 CpG sites), and all measured CpG sites (right, n = 326,751). Significant difference of distribution (p ≤ 3.03 × 10–50) is marked with (***) and non-significant (p > 0.05) with (n.s.), based on a two-sided Mann-Whitney test. g) Methylation fraction at the same mutated CpG sites as Fig. 1d (n = 8,037 sites). CpG sites are binned into five groups based on MAF, with violin plots summarizing the distribution of methylation fraction within each group. Vertical bars inside each violin represent the interquartile range. Two-sided p value calculated based on the exact distribution of Pearson’s r modeled as a beta function.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. Magnitude of methylation change near somatic mutations by tissue and genomic context.
a) Boxplots of the distribution of ΔMF10kb values for mutated (red) versus random control (n = 260,000, blue) sites for each tissue type separately (n = 813, 144, and 1,643 mutated sites from Pancreas, Brain, and Ovary tissues, respectively). P value shown for a two-sided Mann-Whitney test for a difference in median methylation fraction between the mutated and non-mutated random control loci. P value shown for a two-sided Mann-Whitney test for a difference in median absolute deviation (MAD) of ΔMF10kb between the mutated and non-mutated random control loci. The central line represents the median, the edges of the box the interquartile range (IQR), and the whiskers 1.5-times the IQR. b) A histogram of the median methylation fraction across comparison sites within ±10 kb of mutated (n = 2,600, red) and random control sites (n = 260,000, blue). Mutated sites are the same as Fig. 3b. Random control sites have been selected as before, with the additional criteria of having a methylation profile matched to that of the matched samples at mutated sites (as measured by the median methylation fraction of comparison sites, Methods). P value shown for a two-sided Mann-Whitney test for a difference in median methylation fraction between the mutated and random control loci. c) Probability distribution of ΔMF10kb values for mutated (red) versus random control (blue) sites. Mutated and random sites are the same as (b). P value calculated as in (a). d) Line plot depicting the fold enrichment for mutated over non-mutated random control sites as a function of ΔMF10kb, for the same sites as Fig. 3b. Sites are stratified depending on whether the site is a CpG and/or falls within a CpG island (n = 419 CpG-non-CGI, 21 CpG-CGI, 2,120 non-CpG-nonCGI, and 39 non-CpG-CGI sites). Fold enrichment is the ratio of the probability of observing a given ΔMF10kb for mutated sites versus non-mutated random control sites. ΔMF10kb is divided into equally spaced bins from –0.4 to 0.4. e) Barchart showing the fold-enrichment of mutated sites with the most extreme methylation changes (absolute ΔMF10kbZscore>1.96, n = 401 mutated sites) in various genomic regions, compared to all other mutated sites (n = 2,199 mutated sites). P values were calculated using a two-sided Fisher exact test. The categories ‘Upstream gene’ and ‘Downstream gene’ refer to variants located within 1 kb of the 5’ transcription start site and the 3’ transcription stop site, respectively, but outside the gene itself. f) As in (e), but comparing the mutated sites with the most extreme gains of methylation (Z-score of ΔMF10kb>1) to those with the most extreme losses of methylation (Z-score of ΔMF10kb<1). g) Boxplot of the ΔMF10kb value as a function of the mutated allele frequency (MAF). Same sites and samples as Fig. 3e (n = 3,880 mutated loci. The Pearson correlation is shown for the association of MAF with ΔMF10kb and the absolute value of ΔMF10kb. Two-sided p values were calculated based on the exact distribution of Pearson’s r modeled as a beta function. The central line represents the median, the edges of the box the interquartile range (IQR), the whiskers 1.5-times the IQR, and the points all ΔMF10kb value outside of these ranges.
Extended Data Fig. 4 |
Extended Data Fig. 4 |. Mutation-associated methylation change in normal tissues.
a) Probability distribution of ΔMF1kb values for mutated (red) versus random control (blue) sites. Includes n = 463 mutated sites (n = 146 samples) with MAF ≤ 0.15, ≥10 matched individuals (individuals of same tissue type within ± 10 years of age), and ≥1 measured CpG within the window. Random control sites include n = 46,300 non-mutated sites (n = 146 samples, Methods). P value shown for a two-sided Mann-Whitney test for a difference in median absolute deviation (MAD) of ΔMF1kb between the mutated and non-mutated random control loci. b) Line plot depicting the fold enrichment for mutated over non-mutated sites as a function of ΔMF1kb. Fold enrichment is the ratio of the probability of observing a given ΔMF1kb for mutated sites versus the probability of that ΔMF1kb for nonmutated control sites. ΔMF1kb is divided into equally spaced bins from –0.45 to 0.45. c) Absolute ΔMF1kb as the window center is moved away from the mutated site (n = 463, red). This quantity is also shown for non-mutated random control sites (n = 46,300, blue) (Methods). Points indicate the mean value and error bars denote the 95% confidence interval. A significant difference in distribution of absolute ΔMF1kb values (two-sided t-test) is marked (**, p ≤ .01), (*, p ≤ .05). Other comparisons are non-significant (n.s., p > 0.05).
Extended Data Fig. 5 |
Extended Data Fig. 5 |. Supplemental age prediction accuracy.
a) Bar plot indicating the correlation of chronological age with the age predictions of mutation clocks (left) or methylation clocks (right). Correlations are shown across all tumor tissues (n = 1,601) and in each of five TCGA tumor tissues individually: LGG (Brain), GBM (Brain-2), SARC (Bone), KIRP (Kidney), and THCA (Thyroid). b) As in (a) but for age predictions using samples from normal (that is non-cancerous) tissues (n = 40 individuals). c) Heatmap indicating the pairwise consistencies (Pearson correlation) among the mutation age in normal tissue, mutation age in tumor tissue, and chronological age. Data shown for n = 22 individuals with mutations measured in both normal and tumor tissues (the same individuals as from panel b with the exception of 11 colon samples and 7 liver samples as these were not available in the tumor samples). d) As in (c), but comparing predictions from methylation clocks. e) Scatter plot of human individuals, showing age predictions from the mutation model versus their chronological age. Shared area denotes the 95% confidence interval of the line of best fit. Includes 40 individuals from four normal tissues (Methods). A two-sided p value was calculated based on the exact distribution of Pearson’s r modeled as a beta function. f) Similar to panel (b) but showing age predictions from the methylation rather than mutation model. g) Violin plots of the methylation age residual versus mutation age residual (Methods). Plots include the same individuals as in panels (b,c). Pearson r refers to the correlation between methylation age residual and mutation age residual, controlling for chronological age (that is, partial correlation, p = 1.76 × 10–3). The central line of the inner boxplot represents the median, the edges of the box the interquartile range (IQR), the whiskers 1.5-times the IQR, and the points all the methylation age residual values. Statistics calculated as in (e).
Extended Data Fig. 6 |
Extended Data Fig. 6 |. Performance comparison to previous epigenetic clocks.
a) Pearson r between predicted and chronological age for Hannum, Horvath, and PhenoAge clocks across the same samples as Fig. 4b (n = 1,601). Predictions were done using the subset of features from each clock that existed in our methylation data after quality control (66%, 63%, and 61% of CpG sites from the Hannum, Horvath, and PhenoAge clocks, respectively). The performance of this study’s methylation clock is not shown as it is inherently fit to the TCGA dataset in 5-fold CV. b) Pearson r between predicted and chronological age for Hannum, Horvath, and PhenoAge clocks after re-fitting (Methods). Same samples as (a). The performance of the methylation clock trained in this study (‘This study’) is shown for reference.
Extended Data Fig. 7 |
Extended Data Fig. 7 |. Mutation age prediction without whole-genome features.
a) Correlation of chronological versus predicted age, shown for mutation or methylation clocks built without whole-genome features (n = 1,601 individuals). Correlations are shown across all tissues and in each of five TCGA tissues individually: LGG (Brain), GBM (Brain-2), SARC (Bone), KIRP (Kidney), and THCA (Thyroid). b) As in (a) but for age predictions using samples from normal (that is non-cancerous) tissues (n = 40). c) The methylation age residual is plotted versus the mutation age residual, using clocks without whole-genome features (Methods). Violin plots summarize the same samples as in panel (a). Pearson r refers to the correlation between methylation age residual and mutation age residual, controlling for chronological age (that is, partial correlation, p = 6.66 × 10–105). The central line of the inner boxplot represents the median, the edges of the box the interquartile range (IQR), and the whiskers 1.5-times the IQR. A two-sided p value was calculated based on the exact distribution of Pearson’s r modeled as a beta function. d) Similar to (c), but for the samples in (b). The central line of the inner boxplot represents the median, the edges of the box the interquartile range (IQR), the whiskers 1.5-times the IQR, and the points all the methylation age residual values. Statistics calculated as in (c).
Fig. 1 |
Fig. 1 |. Frequency and methylation status of CpG mutation events.
a, Percentage of genome-wide somatic mutations classified as CpG (n = 467,079 mutations) or non-CpG (n = 2,990,796 mutations). Expected percentages were calculated supposing mutation probability to be uniform across the genome (Methods). b, Diagram showing two categories of CpG sites: those where no individual is mutated (nonmutated CpG site, gray) and those where a mutation has occurred in at least one individual (mutated CpG site, red; bottom) and the remaining individuals are nonmutated (blue; top). c, Distribution of CpG methylation values for the categories of CpG sites from b. The methylation fractions of mutated individuals (red) and nonmutated individuals (blue) are shown for the 1,000 CpG sites with the highest MAF (corresponding to MAF > 0.53; Methods). d, Methylation change between mutated and nonmutated individuals at n = 8,037 mutated CpG sites. Methylation change is the difference between the median methylation fraction in mutated individuals and the median methylation fraction in nonmutated individuals of matched age and tissue. CpG sites are binned into five groups based on MAF, with violin plots summarizing the distribution of methylation changes within each group. Vertical bars inside each violin represent the interquartile range. The two-sided P value was calculated based on the exact distribution of Pearson’s r modeled as a beta function. MAF, mutant allele fraction.
Fig. 2 |
Fig. 2 |. Association of mutations with regional methylation patterns.
a, Example mutated site where the individual TCGA-GV-A3QI has a C > T mutation at chr16:56,642,556 of the hg19 human genome. Top, ideogram of chromosome 16, with a red bar indicating the location of the mutated site. The first underlying track shows hg19 base-pair coordinates, the second the documented genes in the region (encoding five metallothionein factors) and the third the locations of CpG sites measured on the Illumina 450k methylation array (vertical bars). Bottom, heatmap of CpG methylation fractions. Rows are samples (1 mutated, 28 matched), and columns are the measured CpGs within a ±50-kb window proximal to the mutation (n = 62 CpG sites). Color corresponds to the methylation fraction of each CpG. The mutated sample row and mutated site column are labeled in red, with the mutation event indicated by a lightning bolt. b, Calculation of the change in methylation fraction ΔMF with reference to a specific mutated site. Left, heatmap of methylation fractions of the mutated site and CpGs in the surrounding window, replicated from a. Right, heatmap of the corresponding differences in methylation between each sample (row) and all other samples in the matrix (median of other rows), computed separately for each site in the window (columns). The final ΔMF value was calculated as the overall methylation change of the mutated sample, taking the median across all sites in the window (Methods). Matched background samples were defined as those without any somatic mutations in the window and that were of the same tissue type and approximate age (±5 years) as the mutated sample. UCSC, University of California, Santa Cruz.
Fig. 3 |
Fig. 3 |. Magnitude and extent of methylation changes near somatic mutations.
a, Absolute ΔMF1kb as the window center is moved away from the mutated site (n = 2,600, red). This quantity is also shown for nonmutated random control sites (n = 260,000, blue) (Methods). Points indicate the mean value, and error bars denote the 95% confidence interval. A significant difference in the distribution of absolute ΔMF1kb values (two-sided t test) is marked (***P ≤ 0.001, **P ≤ 0.01). Other comparisons are nonsignificant (NS, P > 0.05). b, Probability distribution of ΔMF10kb values calculated in a ±10-kb window surrounding mutated (red) versus random control (blue) sites. Mutated sites include n = 2,600 mutated sites with MAF ≥ 0.8, ≥15 matched individuals (individuals of the same tissue type within ±5 years of age) and one or more measured CpGs within the window. Random control sites include n = 260,000 nonmutated sites (Methods). The same was found when controlling for the initial methylation state of mutated and random loci (Extended Data Fig. 3b,c). The P value is shown for a two-sided Mann–Whitney test for a difference in median absolute deviation (MAD) of ΔMF10kb between the mutated and nonmutated random control loci. c, Line plot depicting the fold enrichment for mutated over nonmutated sites as a function of ΔMF10kb. Fold enrichment is the ratio of the probability of observing a given ΔMF10kb for mutated sites versus the probability of that ΔMF10kb for nonmutated control sites. ΔMF10kb is divided into equally spaced bins from −0.4 to 0.4. d, Enrichment of extreme ΔMF10kb values at CpG sites and CpG islands. Top versus bottom bar charts show the 25% of mutations with the most positive versus most negative ΔMF10kb values in a (n = 650 mutations each). The enrichment of these mutations (bars, y axis) was considered for different types of sites, depending on whether the site is a CpG and/or falls within a CpG island (x-axis categories). Enrichment was compared to the genomic baseline (Methods), with significance determined by a one-sided binomial test. Significant enrichment (P ≤ 0.001) is marked with asterisks (***) and nonsignificant (P > 0.01) with ‘NS’. CpG islands are defined as genomic regions with ≥200 bp, ≥50% GC content and a high CpG occurrence. e, Boxplot of the absolute ΔMF10kb value as a function of the MAF. The plot includes all mutated sites with ≥15 matched samples and one or more measured CpGs within ±10 kb (n = 3,880 mutated loci). The two-sided P value was calculated based on the exact distribution of Pearson’s r modeled as a beta function. The central line represents the median; the edges of the box represent the interquartile range; the whiskers indicate 1.5 times the interquartile range; and the points represent all ΔMF10kb values outside these ranges.
Fig. 4 |
Fig. 4 |. Association among mutation age, methylation age and chronological age.
a, Methylation clock: the methylation fractions of CpGs are used in a gradient boosted tree model to predict chronological age. Mutation clock: the count of mutations around the same CpGs are used in an identical model to predict chronological age. Both models incorporate similar covariates and whole-genome features (Methods). b, Scatter plot of human individuals, showing age predictions from the mutation model versus their chronological age. The plot includes n = 1,601 individuals with samples from five tissues (Methods). c, Similar to b but showing age predictions from the methylation rather than mutation model for the same individuals. d, Violin plots of the methylation age residual versus the mutation age residual (Methods). The plot includes the same individuals as in b and c. Pearson’s r refers to the correlation between the methylation age residual and the mutation age residual (that is, partial correlation, P = 1.48 × 10−82, two-sided P value calculated based on the exact distribution of Pearson’s r modeled as a beta function). The central line of the inner boxplot represents the median; the edges of the box represent the interquartile range; and the whiskers represent 1.5 times the interquartile range. e, Distribution of methylation age residuals for the same individuals as in b and c, computed according to each of four previous methylation clocks. ‘This study’ refers to the methylation clock shown in c (Methods). For each clock, the 20% (n = 320) of individuals with the youngest mutation age for their chronological age are shown in a lighter color (low mutation age residual), and the 20% (n = 320) of individuals with the oldest mutation age for their chronological age are shown in a darker color (high mutation age residual). The asterisks (***) indicate a significant (P ≤ 10−51) difference in distribution between the low and high mutation residual age groups, based on a two-sided Mann–Whitney U test. f, Bar plot depicting the ratio of observed versus expected overlap between sets of age-associated CpG sites. For the same individuals and CpG sites as in c, the CpGs with maximal (top 1%, 5% and 10%) Pearson’s correlation between local mutation burden (±10 kb) and age and between methylation fraction and age were chosen. The intersection (overlap) between these sets was compared to the expected intersection assuming random selection (Methods). Significant enrichment based on a two-sided binomial test (P ≤ 10−5, Bonferroni corrected) is marked with asterisks (***). g, Mean mutation burden (left y axis) or mean methylation fraction (right y axis) plotted versus chronological age (x axis) for CpG site cg19236454. Data were from brain (LGG, low grade-glioma) samples, considering individuals with a nonzero mutation burden (±10 kb) at this site (n = 67). Pearson’s correlation with chronological age: mutation burden = 0.18, methylation = −0.18. Error bars denote the standard error. h, Diagram summarizing the relationships among three measures of age: mutation, methylation and chronological time. The variance explained was calculated as the squared Pearson’s correlation between each pair of measures for the same individuals as in b and c. MAE, mean absolute error.

Update of

References

    1. Szilard L. On the nature of the aging process. Proc. Natl Acad. Sci. USA 45, 30–45 (1959). - PMC - PubMed
    1. Cagan A. et al. Somatic mutation rates scale with lifespan across mammals. Nature 604, 517–524 (2022). - PMC - PubMed
    1. Alexandrov LB et al. Clock-like mutational processes in human somatic cells. Nat. Genet 47, 1402–1407 (2015). - PMC - PubMed
    1. Moore L. et al. The mutational landscape of human somatic and germline cells. Nature 597, 381–386 (2021). - PubMed
    1. Jaiswal S. & Ebert BL Clonal hematopoiesis in human aging and disease. Science 366, eaan4673 (2019). - PMC - PubMed

LinkOut - more resources