Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 30;23(2):qzaf040.
doi: 10.1093/gpbjnl/qzaf040.

UNISOM: Unified Somatic Calling and Machine Learning-based Classification Enhance the Discovery of CHIP

Affiliations

UNISOM: Unified Somatic Calling and Machine Learning-based Classification Enhance the Discovery of CHIP

Shulan Tian et al. Genomics Proteomics Bioinformatics. .

Abstract

Clonal hematopoiesis (CH) of indeterminate potential (CHIP), driven by somatic mutations in leukemia-associated genes, confers increased risk of hematologic malignancies, cardiovascular disease, and all-cause mortality. In blood of healthy individuals, small CH clones can expand over time to reach 2% variant allele frequency (VAF), the current threshold for CHIP. Nevertheless, reliable detection of low-VAF CHIP mutations is challenging, often relying on deep targeted sequencing. Here, we present UNISOM, a streamlined workflow for enhancing CHIP detection from whole-genome and whole-exome sequencing data that are underpowered, especially for low VAFs. UNISOM utilizes a meta-caller for variant detection, in couple with machine learning models which classify variants into CHIP, germline, and artifact. In whole-exome sequencing data, UNISOM recovered nearly 80% of the CHIP mutations identified via deep targeted sequencing in the same cohort. Applied to whole-genome sequencing data from Mayo Clinic Biobank, it recapitulated the patterns previously established in much larger cohorts, including the most frequently mutated CHIP genes and predominant mutation types and signatures, as well as strong associations of CHIP with age and smoking status. Notably, 30% of the identified CHIP mutations had < 5% VAFs, demonstrating its high sensitivity toward small mutant clones. This workflow is applicable to CHIP screening in population genomic studies. The UNISOM pipeline is freely available at https://github.com/shulanmayo/UNISOM and https://ngdc.cncb.ac.cn/biocode/tool/7816.

Keywords: Clonal hematopoiesis of indeterminate potential; Machine learning; Somatic variant calling; Whole-exome sequencing; Whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

Graphical abstract
Graphical abstract
Figure 1
Figure 1
CHIP prediction with UNISOM pipeline A. Overview of UNISOM pipeline. The pipeline has 3 major components: meta-calling, ML-based prediction, and CHIP processing. Meta-caller is an ensemble of Mutect2, VarDict, and VarTracker. It merges variants from the 3 tools and assigns a status of 1 to 3 to indicate the number of tools identifying a given variant. Raw variants predicted to have functional effects by CAVA, along with the associated features, are used as the inputs for XGBoost or random forest prediction. These features, 26 for SNVs and 24 for INDELs, cover variant quality metrics, calling status, and genomic context, as well as overlap with public variant databases. The prediction model was trained separately for WES versus WGS and for SNV versus INDEL. The predictions based on XGBoost or random forest model are subjected to further refinement and prioritization. The refinement step recovers those that are most likely to be erroneously predicted as non-CHIP for manual inspection. To enable prioritization of the predicted CHIP, a hierarchical Bayesian model is used to estimate their CIs. The model utilizes CHIP incidence in a population of healthy individuals (here with age < 40 years) as the background. B. Key steps recommended for CHIP discovery. To run UNISOM, raw BAM is pre-processed with Picard, which includes coordinate sorting, duplicate marking, and filtering based on mapping quality. UNISOM predicted CHIP mutations are then filtered, such as removal of common variants and extraction of subset based on user provided lists of leukemia driver genes and driver mutations. Finally, to rule out the possibility of being alignment artifacts, it is often necessary to check alignments around the retained variants in a genome browser. CAVA, clinical annotation of variants; CHIP, clonal hematopoiesis of indeterminate potential; CI, confidence interval; XGBoost, eXtreme Gradient Boosting; INDEL, insertion and deletion; ML, machine learning; SNV, single nucleotide variant; VCF, variant call format; WES, whole-exome sequencing; WGS, whole-genome sequencing.
Figure 2
Figure 2
Recall of 11 tools tested on simulation data A. Simulated SNVs and INDELs in WES. B. Simulated SNVs and INDELs in WGS. Each box plot in (A) and (B) uses recall rates estimated from 7 WES and 11 WGS data, respectively, split into SNV and INDEL. All data are from batch 1 simulation that uses CHIP-specific VAFs, excluding those with > 100× coverage. The 7 WES data are from datasets 13, 18, and 19 with coverage of 20×, 50×, and 100×, while the 11 WGS data are from datasets 1, 3, 4, 11, and 12 with coverage of 20×, 29×, 50×, 84×, and 100× (Table S2). VarTracker, VarDict, and GATK Mutect2 showed the highest recall for both SNVs and INDELs. GATK HC, GATK HaplotypeCaller; GATK UG, GATK UnifiedGenotyper; VAF, variant allele frequency.
Figure 3
Figure 3
Overlap of spike-in CHIP variants recovered by the three callers A. NA12878_02 WES simulated at 100× coverage. At least 1 read supports alternative allele. B. NA12878_02 WGS simulated at 50× coverage. At least 1 read supports alternative allele. C. The same WES data as in (A). At least 2 reads support alternative allele. D. The same WGS data as in (B). At least 2 reads support alternative allele. Both data are simulated with CHIP-specific VAFs. In each plot, SNVs and INDELs are combined. While VarTracker can report variants with a single supporting read, Mutect2 and VarDict only output those with ≥ 2 supporting reads.
Figure 4
Figure 4
Contributions of the top 13 features to the predictive performance of XGBoost A. Feature’s gain in predicting SNVs from WGS. B. Feature’s gain in predicting SNVs from WES. C. Feature’s gain in predicting INDELs from WGS. D. Feature’s gain in predicting INDELs from WES. The gain value is calculated using the xgb.importance function within the mIr package, which indicates a feature’s relative contribution to the model. The 13 features are selected to have gain values of ≥ 0.02 in at least 1 of the 4 predictions, showing in ascending order based on (A). The number before each feature is from the “Feature” column in Table S4. 1000G, 1000 Genomes Project; MAF, minor allele frequency; COSMIC, Catalogue Of Somatic Mutations In Cancer; FREQ, number of samples carrying the variant in the COSMIC database.
Figure 5
Figure 5
CHIP mutations in Mayo Biobank cohort A. VAFs of CHIP mutations. The bottom and upper lines in the box plot represent the 25% and 75% percentiles, respectively, with the horizontal line within representing the median. CHIP mutations are separated based on the tool(s) identifying them. The table at the bottom lists the number of CHIP mutations only identified by each of the 3 tools and by their 4 combinations, separated by VAF. No CHIP mutation is detected by Mutect2 alone (“Mutect2 only”) or by Mutect2 plus VarDict (“Mutect2 + VarDict”). Of the 3 tools, VarTracker is most sensitive at low VAFs of < 10%, with 41 (15 + 26) unique calls. VarTracker and VarDict both identify CHIP mutations at VAF down to 2.6%, while Mutect2 reaches 4.5% VAF. B. CHIP prevalence in the top 6 genes with the most mutations. For each gene, Y-axis shows the number of participants who carry at least 1 CHIP mutation in that gene. C. Number of CHIP mutations in the 6 genes. The total mutations per gene are split by mutation type. For each gene, if two or more mutations occur in the same individual, or if the same mutation occurs in two or more individuals, they are summed up. D. VAFs of CHIP mutations in the top 6 genes. No INDEL is identified in TP53. E. Signatures enriched in CHIP mutations. F. Prevalence of CHIP mutations in 3 age groups. There is a trend of increased CHIP prevalence by age as previously reported. G. VAF versus age of CHIP carriers. Considering all the 35 mutated genes and the top 6 genes, there is a clear trend that the proportion of “large” mutated clones (VAF 10%) increases with age.

Similar articles

References

    1. Steensma DP, Bejar R, Jaiswal S, Lindsley RC, Sekeres MA, Hasserjian RP, et al. Clonal hematopoiesis of indeterminate potential and its distinction from myelodysplastic syndromes. Blood 2015;126:9–16. - PMC - PubMed
    1. Steensma DP. Clinical implications of clonal hematopoiesis. Mayo Clin Proc 2018;93:1122–30. - PubMed
    1. Khetarpal SA, Qamar A, Bick AG, Fuster JJ, Kathiresan S, Jaiswal S, et al. Clonal hematopoiesis of indeterminate potential reshapes age-related CVD: JACC Review Topic of the Week. J Am Coll Cardiol 2019;74:578–86. - PMC - PubMed
    1. Bick AG, Weinstock JS, Nandakumar SK, Fulco CP, Bao EL, Zekavat SM, et al. Inherited causes of clonal haematopoiesis in 97,691 whole genomes. Nature 2020;586:763–8. - PMC - PubMed
    1. Jaiswal S, Ebert BL. Clonal hematopoiesis in human aging and disease. Science 2019;366:eaan4673. - PMC - PubMed

LinkOut - more resources