Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug;54(8):1155-1166.
doi: 10.1038/s41588-022-01121-z. Epub 2022 Jul 14.

Genome-wide analyses of 200,453 individuals yield new insights into the causes and consequences of clonal hematopoiesis

Affiliations

Genome-wide analyses of 200,453 individuals yield new insights into the causes and consequences of clonal hematopoiesis

Siddhartha P Kar et al. Nat Genet. 2022 Aug.

Abstract

Clonal hematopoiesis (CH), the clonal expansion of a blood stem cell and its progeny driven by somatic driver mutations, affects over a third of people, yet remains poorly understood. Here we analyze genetic data from 200,453 UK Biobank participants to map the landscape of inherited predisposition to CH, increasing the number of germline associations with CH in European-ancestry populations from 4 to 14. Genes at new loci implicate DNA damage repair (PARP1, ATM, CHEK2), hematopoietic stem cell migration/homing (CD164) and myeloid oncogenesis (SETBP1). Several associations were CH-subtype-specific including variants at TCL1A and CD164 that had opposite associations with DNMT3A- versus TET2-mutant CH, the two most common CH subtypes, proposing key roles for these two loci in CH development. Mendelian randomization analyses showed that smoking and longer leukocyte telomere length are causal risk factors for CH and that genetic predisposition to CH increases risks of myeloproliferative neoplasia, nonhematological malignancies, atrial fibrillation and blood epigenetic ageing.

PubMed Disclaimer

Conflict of interest statement

G.S.V. is a consultant to STRM.BIO and holds a research grant from AstraZeneca for research unrelated to that presented here. J.M. and S.P. are current employees and/or stockholders of AstraZeneca. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Characterization of CH in the UKB.
a, Composite plot summarizing mutations in the 10 most common driver genes in 10,924 individuals with CH. Each column in the waterfall plot represents a single individual, with mutation types color-coded. Bars on the left quantify mutations per gene as a percentage of all CH mutations identified. Violin plots on the right show the distribution of VAFs, with vertical lines representing the median and dots with horizontal lines the mean ± s.d. b, Empirical cumulative distribution (ECD) of the age of individuals with CH overall (black) and stratified by the eight most common driver genes. Compared with DNMT3A, mutations in ATM were observed 3 yr earlier (P = 7.2 × 10-4), while mutations in ASXL1, PPM1D, SRSF2 and SF3B1 were observed 1 (P = 2.7 × 10−8), 1 (P = 8.5 × 10−6), 2 (P = 5.7 × 10−10) and 3 (P = 6.5 × 10−6) years later, respectively. Differences were calculated using two-sided pairwise Wilcoxon rank sum tests. c, Bar plot showing the female to male (F:M) ratio of CH carriers with mutations in the ten most common driver genes. GNB1 (P = 2.3 × 10−3) and DNMT3A (P = 3.2 × 10−11) show a higher F:M ratio, while PPM1D (P = 6.4 × 10−3), TP53 (P = 1 × 10−3), JAK2 (P = 5.7 × 10−3), SF3B1 (P = 3.5 × 10−4), ASXL1 (P = 5.7 × 10−28) and SRSF2 (P = 3.8 × 10−14) show lower F:M ratio. ‘Other’ represents the remaining driver genes grouped together and ‘Ctrl’ the ratio for individuals without CH. Dotted vertical line shows the F:M ratio observed in the full cohort (F:M = 1.2). P values are from a chi-squared test comparing the distribution for each gene with ‘Ctrl’.
Fig. 2
Fig. 2. Associations between CH and diverse traits/diseases.
a,b, Heatmaps showing associations between overall CH (CH); CH with large (CH large) and small (CH small) clones; and CH driven by DNMT3A, TET2, ASXL1, JAK2 and SRSF2 + SF3B1 mutations, with prevalent (a) or incident (b) traits/diseases. Prevalent traits include baseline characteristics, blood counts and serum biochemistry, while incident traits include hematopoietic neoplasms, cancer types, CVDs and death. Red–blue color scale represents the OR or hazard ratio (HR). ORs were calculated using a logistic regression model, while HRs were calculated using competing risk models, except for the death analysis where a Cox proportional hazards model was used. Gray color represents failure of the logistic regression model (maximum likelihood estimation algorithm) to converge. Asterisks represent significant associations, and their size represents different unadjusted P value cut-offs. All ORs/HRs, 95% CIs, sample sizes, P values and FDRs (to adjust for multiple comparisons) are reported in Supplementary Tables 5–7, 10 and 13. T2DM, type 2 diabetes mellitus; RDW, red blood cell (erythrocyte) distribution width; PDW, platelet distribution width; PCT, plateletcrit; PLT, platelet count: WBC, white blood cell (leukocyte) count; HLR, high light scatter reticulocyte count; RET, reticulocyte count; NE, neutrophil count; MO, monocyte count; LY, lymphocyte count; RBC, red blood cell (erythrocyte) count; NRBC, nucleated RBC; EO, eosinophil count; HT, hematocrit percentage; HGB, hemoglobin concentration; MCH, mean corpuscular hemoglobin; MCV, mean corpuscular volume; CYS, cystatin C; PHOS, phosphate; GGT, gamma-glutamyltransferase; ALT, alanine aminotransferase; AST, aspartate aminotransferase; APOA, apolipoprotein A; HDL, high-density lipoprotein cholesterol; LDLD, low-density lipoprotein direct cholesterol; CHOL, total cholesterol; HBA1C, glycosylated hemoglobin; AML, acute myeloid leukemia; MDS, myelodysplastic syndromes; CMML, chronic myelomonocytic leukemia.
Fig. 3
Fig. 3. Cell-type-specific enrichment of the CH polygenic signal.
a, Heritability enrichment of CH across histone marks profiled in ten cell-type groups. b, Heritability enrichment of CH across open chromatin regions identified by ATAC-seq in hematopoietic progenitor cells/lineages at different stages of differentiation. Partitioned heritability cell-type group analysis in the LDSC software was used to compute these enrichments and corresponding P values. The data underlying the figures are available in Supplementary Tables 14 and 15. CNS, central nervous system; GI, gastrointestinal; CLP, common lymphoid progenitor; CMP, common myeloid progenitor; MPP, multipotent progenitor; GMP, granulocyte/macrophage progenitor; LMPP, lymphoid-primed multipotent progenitor; NK, natural killer cell; Mono, monocyte; Erythro, erythroid progenitor.
Fig. 4
Fig. 4. Manhattan plots displaying genome-wide associations between common germline genetic variants and each of five CH traits.
The y axes depict P values (−log10) for associations derived from the noninfinitesimal mixed model association test implemented in BOLT-LMM. The x axes depict chromosomal position on build 37 of the human genome (GRCh37). The dotted lines indicate the genome-wide significance threshold of P = 5 × 10−8. Known (previously published) and new loci are indicated by cytoband and target gene (based on the prioritization exercise described in the text). Since there were multiple independent loci at 5p15.33 (LD r2 < 0.05), we also label the 5p15.33 signals using the lead variant rs number for each signal. Our prioritization exercise was focused on protein coding genes near each lead variant and since there were no protein coding genes within 1 Mb of the lead variant at 5p13.3, we labeled this association using the nearest noncoding RNA. The CH traits corresponding to each Manhattan plot are: a, Overall CH. b, CH with mutant DNTM3A. c, CH with mutant TET2. d, CH with large clones. e, CH with small clones.
Fig. 5
Fig. 5. Gene-level association and PPI network analyses.
a, Gene-level associations in the 6q21 region within 25 kb of CRIP3, that is, between GRCh37 positions 43,017,448 and 43,526,535 on chromosome 6. The x axis lists all the genes in this region that were tested by both MAGMA and SMR. MAGMA uses a multiple linear principal components regression model while SMR is based on the Wald test. CRIP3 was the only gene located more than 1 Mb away from a GWAS-identified lead variant that was found to be associated with CH at gene-level genome-wide significance by both MAGMA and SMR. The y axis depicts the P value (−log10) for association in the MAGMA and SMR analyses. The gene-level genome-wide significance threshold in MAGMA (P = 2.6 × 10−6 after accounting for 19,064 genes tested) is indicated by the blue dashed line and in SMR (P = 3.2 × 10−6 after accounting for 15,672 genes tested) by the orange dotted line. Both CRIP3 and SRF had SMR HEIDI P > 0.05 indicating colocalization of the GWAS and eQTL associations. The HEIDI test is a test of heterogeneity of Wald ratio estimates. b, Largest subnetwork of genes/proteins associated with overall CH risk identified by the NetworkAnalyst tool. NetworkAnalyst uses a ‘Walktrap’ random walks search algorithm to identify the largest first-order interaction network. All genes (n = 57) with PMAGMA < 0.001 in the overall CH MAGMA analysis were mapped to proteins and used as ‘seeds’ for network construction which was done by integrating high-confidence PPIs from the STRING database. The largest subnetwork constructed contained 13 of the 57 seed proteins and included 210 nodes and 231 edges. The colored nodes indicate seed proteins that interact with at least two other proteins in this subnetwork with the intensity of redness increasing with number of interacting proteins. Seed proteins that interact with six or more other proteins in the subnetwork are named above their corresponding node.
Fig. 6
Fig. 6. IVW MR forest plots with CH traits as outcomes.
ac, ORs for CH risk are represented as per (1) standard deviation unit for continuous exposures (alcohol use in drinks per week, BMI, waist-to-hip ratio adjusted for BMI (WHRadjBMI) (a); LTL, two epigenetic aging traits, and red cell, white cell and platelet counts (b); and five circulating lipid traits (c)) and (2) log-odds unit for binary exposures (smoking initiation (ever having smoked regularly) and genetic liability to T2D (a)). IVW regression was used for all MR analyses, and results were not adjusted for multiple comparisons. Details of units are provided in Supplementary Table 34. Symbols represent OR markers, and OR marker symbols with corresponding P < 0.05 are represented by filled circles. Error bars represent 95% CIs. Sample sizes for the smoking, alcohol, BMI, WHRadjBMI, T2D, apolipoproteins B and A-I, LDL, HDL and triglycerides analyses are provided in Supplementary Table 34. Sample sizes for the LTL, IEAA, Hannum and three blood cell count analyses are provided in Supplementary Table 35. Full results, including from sensitivity analyses, are presented in Supplementary Tables 36–38. WHRadjBMI, waist-to-hip ratio adjusted for BMI; LDL, low-density lipoprotein cholesterol; IEAA, intrinsic epigenetic age acceleration.
Fig. 7
Fig. 7. IVW MR forest plots with CH traits as exposures.
Forest plots with OR markers (for cancers and cardiovascular/metabolic traits) or exponentiated beta coefficient (exp(beta)) markers (for blood cell traits, lipids, adiposity measures and epigenetic aging indices). ORs/exp(betas) are represented as per log-odds unit increase in genetic liability to overall CH (a) or DNMT3A-CH (b). OR/exp(beta) markers with corresponding P < 0.05 are represented by filled circles. IVW regression was used for all MR analyses, and results were not adjusted for multiple comparisons. Symbols represent OR markers and error bars represent 95% CIs. Red symbols and error bars represent results using genetic instruments comprised exclusively of genome-wide significant (P < 5 × 10−8) variants. Black symbols and error bars represent results when using genome-wide significant and sub-GWS (P < 10−5) variants in the genetic instrument. Large effect size estimates (ORs/exp(betas)) are shown in the lower panels. Sample sizes for all genome-wide association datasets used are provided in Supplementary Table 35. Full results, including from sensitivity analyses, are presented in Supplementary Tables 39 and 40. IS, ischemic stroke.
Extended Data Fig. 1
Extended Data Fig. 1. Characterization of CH in the UK Biobank.
a, Histogram stratified by sex showing the age distribution of individuals in the UKB cohort (n=200,453). b, Overall percentage of females and males in the UKB cohort. c, Percentage of the most common self-reported ancestry groups in the UKB cohort. Ancestry groups with a frequency lower than 1% were grouped under the ‘Other ancestry group’ category. d, Number of individuals with 1, 2, 3, and 4 somatic mutations. More than 90% of individuals with CH had only one driver mutation identified. e, Percentages of different CH mutation types identified. f, Relative prevalence of each of the six base substitution types amongst the identified CH mutations. g, Density plot showing the variant allele fraction (VAF) distribution of all CH somatic mutations. h, Density plot showing similar VAF distribution for different mutation types. Mean and median are indicated for g and h.
Extended Data Fig. 2
Extended Data Fig. 2. Age distribution of CH by mutant gene, clone size, and sex.
a, Prevalence of CH in the cohort with advancing age. The blue line represents the smoothed model fitted to a generalized additive model with 95% confidence interval (CI; gray shadow). b, Prevalence of CH by age stratified by the top eight most frequently mutated genes. Colored lines represent the smoothed model fitted to a generalized additive model with 95% CI (colored shadows). Y-axis is log-scaled. c, Clone size, estimated by the variant allele fraction (VAF), increases with age. The blue line represents the smoothed model fitted to a generalized additive model and the shadow represents the 95% CI. d, Empirical cumulative distribution (ECD) of the age of individuals with CH stratified by sex. CH was observed one year earlier in females than in males (median 61 versus 62 years; P=1.6x10−4, two-sided pairwise Wilcoxon rank sum test).
Extended Data Fig. 3
Extended Data Fig. 3. Associations between CH and diseases.
a, Phenome-wide association study of CH and incident disease outcomes. Phenotypes were extracted from the International Classification of Diseases version-10 (ICD-10) disease codes and grouped in different categories. A total of 11,787 ICD-10 codes were tested using logistic regression, obtaining results for 2,378. Risk ratio (RR) of each code is represented by a single point with a size scale. The black dashed line represents the phenome-wide significant P-value threshold of 10−9. Only ICD-10 codes with false discovery rate (FDR)<10−15 are annotated to control for multiple comparisons. Full results with RRs, 95% confidence intervals (CIs), P-values and FDRs are reported in Supplementary Table 8. b, Heatmaps showing associations of overall CH (CH), CH with large (CH large) and small (CH small) clones, and CH driven by DNMT3A, TET2, ASXL1, JAK2, and SRSF2+SF3B1 mutations with incident hematopoietic neoplasms and cancer in self-reported non-smokers. Red-blue color scale represents the hazard ratio (HR). HRs were calculated using competing risks models. Gray color represents failure of the logistic regression model (maximum likelihood estimation algorithm) to converge. Asterisk represents a significant association, and its size represents different unadjusted P-value cut-offs. All HRs, 95% CIs, sample sizes, P-values, and FDRs (to adjust for multiple comparisons) are reported in Supplementary Table 11. c, Forest plot showing the hazard ratios (HRs) for cardiovascular disease (CVD) from competing risks analysis in CH using four models: univariate with CH as the only predictor, bivariate including CH and smoking status, bivariate with CH and age, and trivariate with CH, smoking status and age. HR markers with unadjusted P<0.05 are depicted in blue. Symbols represent the HRs and error bars represent 95% CIs. All HRs, 95% CIs, sample sizes, P-values, and FDRs (to adjust for multiple comparisons) are reported in Supplementary Table 12. Abbreviations: AML, acute myeloid leukemia; MDS, myelodysplastic syndromes; MPN, myeloproliferative neoplasms; CMML, chronic myelomonocytic leukemia; CVD, cardiovascular disease.
Extended Data Fig. 4
Extended Data Fig. 4. Multi-ancestry associations for the seven lead variants for overall CH risk.
Comparison of effect size estimates (odds ratios (ORs)) for the seven overall CH risk lead variants between (i) the 505 individuals with CH and 11,893 controls comprising the ancestrally diverse ‘All other ancestries combined’ sub-cohort; based on logistic regression and (ii) the 10,203 individuals with CH and 173,918 controls comprising the ‘European ancestry’ sub-cohort of the 200k UK Biobank cohort; based on linear mixed models (BOLT-LMM). ORs are presented with alignment to the same allele in both sub-cohorts. Symbols represent ORs and error bars represent 95% confidence intervals (CIs).
Extended Data Fig. 5
Extended Data Fig. 5. Heterogeneity of lead GWAS variants across five CH traits.
Forest plots with linear mixed model (BOLT-LMM) odds ratios (ORs) and 95% confidence intervals (CIs) based on data from Supplementary Tables a, 16, b, 18, c, 19, d, 20, and e, 21. Results for lead variants identified at genome-wide significance (P<5x10−8) for each CH trait (a, overall CH, b DNMT3A-CH, c TET2-CH, d large clone CH, and e, small clone CH) are plotted alongside results for the same lead variants in the four other genome-wide association analyses conducted. Symbols represent ORs and error bars represent 95% confidence intervals (CIs) in a, b, c, d, and e. Sample sizes: 10,203 individuals with CH (‘cases’) and 173,918 individuals without CH (‘controls’) for the overall CH analysis; 5,185 cases and 173,918 controls for DNMT3A-CH; 2,041 cases and 173,918 controls for TET2-CH; 4,049 cases and 173,918 controls for large clone CH; and 6,154 cases and 173,918 controls for small clone CH.

References

    1. Zhang L, Vijg J. Somatic mutagenesis in mammals and its implications for human disease and aging. Annu. Rev. Genet. 2018;52:397–419. doi: 10.1146/annurev-genet-120417-031501. - DOI - PMC - PubMed
    1. Martincorena I, Campbell PJ. Somatic mutation in cancer and normal cells. Science. 2015;349:1483–1489. doi: 10.1126/science.aab4082. - DOI - PubMed
    1. Kakiuchi N, Ogawa S. Clonal expansion in non-cancer tissues. Nat. Rev. Cancer. 2021;21:239–256. doi: 10.1038/s41568-021-00335-3. - DOI - PubMed
    1. Genovese G, et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 2014;371:2477–2487. doi: 10.1056/NEJMoa1409405. - DOI - PMC - PubMed
    1. Jaiswal S, et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 2014;371:2488–2498. doi: 10.1056/NEJMoa1408617. - DOI - PMC - PubMed

Publication types