Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 29;40(4):btae121.
doi: 10.1093/bioinformatics/btae121.

ArCH: improving the performance of clonal hematopoiesis variant calling and interpretation

Affiliations

ArCH: improving the performance of clonal hematopoiesis variant calling and interpretation

Irenaeus C C Chan et al. Bioinformatics. .

Abstract

Motivation: The acquisition of somatic mutations in hematopoietic stem and progenitor stem cells with resultant clonal expansion, termed clonal hematopoiesis (CH), is associated with increased risk of hematologic malignancies and other adverse outcomes. CH is generally present at low allelic fractions, but clonal expansion and acquisition of additional mutations leads to hematologic cancers in a small proportion of individuals. With high depth and high sensitivity sequencing, CH can be detected in most adults and its clonal trajectory mapped over time. However, accurate CH variant calling is challenging due to the difficulty in distinguishing low frequency CH mutations from sequencing artifacts. The lack of well-validated bioinformatic pipelines for CH calling may contribute to lack of reproducibility in studies of CH.

Results: Here, we developed ArCH, an Artifact filtering Clonal Hematopoiesis variant calling pipeline for detecting single nucleotide variants and short insertions/deletions by combining the output of four variant calling tools and filtering based on variant characteristics and sequencing error rate estimation. ArCH is an end-to-end cloud-based pipeline optimized to accept a variety of inputs with customizable parameters adaptable to multiple sequencing technologies, research questions, and datasets. Using deep targeted sequencing data generated from six acute myeloid leukemia patient tumor: normal dilutions, 31 blood samples with orthogonal validation, and 26 blood samples with technical replicates, we show that ArCH improves the sensitivity and positive predictive value of CH variant detection at low allele frequencies compared to standard application of commonly used variant calling approaches.

Availability and implementation: The code for this workflow is available at: https://github.com/kbolton-lab/ArCH.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
ArCH Workflow. ArCH workflow for multiple variant consensus calling, sequencing artifact filtering, and putative driver annotation. ArCH consists of four general steps: (i) UMI-based error corrected consensus sequence building, (ii) variant calling using multiple variant callers with subsequent normalization and pre-filtering, (iii) annotation of passed variants with pathogenicity and other relevant information, and (iv) false positive filtering and variant sorting into pass, fail, and review. Filtering steps and annotation are performed throughout the workflow to allow for greater optimization and parallelization. Four main annotations are applied onto all passed variants: (i) FP filter—a collection of false positive filters defined in our methods section, (ii) VEP—Variant Effect Predictor that annotates the effect of the variants on genes, transcripts, and protein sequences, (iii) PoN Fisher Test—The statistical calculation of signal-to-noise for each variant relative to our panel of normal (PoN), and (iv) CH Annotation—A custom annotation script utilizing several external data sources to define CH pathogenicity. The output of the pipeline are three files that separate the variants into (i) those that pass all filters, (ii) variants that are artifacts as defined by the false positive filtering or possible germline variants, and (iii) variants that are recommended for manual review.
Figure 2.
Figure 2.
Sensitivity and positive predicted value (PPV) for the AML dilution series. Sensitivity (A) and PPV (B) for AML dilution series for variants that (i) passed one or more variant callers (square), (ii) passed one or more variant callers and our false positive filters (circle), and (iii) passed one or more variant callers, false positive filters, and was manually reviewed (triangle). Sensitivity was calculated as the number of detected true positives over the total number of bona fide mutations within a given VAF category. PPV was calculated as the number of true positives over the number of total positives as defined by the three criteria defined previously within a given VAF category. Error bars show the 95% confidence interval for sensitivity and PPV as obtained by the Clopper–Pearson intervals method.
Figure 3.
Figure 3.
Sensitivity and positive predicted value (PPV) for the normal blood samples. Sensitivity (A) and PPV (B) for CH detected in normal blood samples for variants that (i) passed one or more variant callers (square), (ii) passed one or more variant callers and our false positive filters (circle), (iii) passed one or more variant callers, false positive filters, and was manually reviewed (triangle). Sensitivity was calculated as the number of detected true positives over the total number of bona fide mutations within a given VAF category. PPV was calculated as the number of true positives over the number of total positives as defined by the three criteria defined previously within a given VAF category. Error bars show the 95% confidence interval for sensitivity and PPV and PPV as obtained by the Clopper–Pearson intervals method.
Figure 4.
Figure 4.
Replication accuracy for technical replicates using ArCH. Replication of bona fide CH variants in normal blood samples. True bona fide CH mutations that passed ArCH in both technical replication samples are shown as triangles. All other bona fide CH mutations that passed in only one of the two replication samples are labeled according to the reason for failing.
Figure 5.
Figure 5.
Sensitivity and positive predicted value (PPV) for the replication samples. Sensitivity (A) and PPV (B) for CH variants using ArCH in blood samples using duplex targeted sequencing of a 31 gene panel. Sensitivity was calculated as the number of variants passed by ArCH divided by the total number of bona fide mutations within a given VAF category. PPV was calculated as the number of true positives over the number of bona fide mutations divided by the total number of variants passed by ArCH. Error bars show the 95% confidence interval for sensitivity and PPV and PPV as obtained by the Clopper–Pearson intervals method.

References

    1. Adzhubei IA, Schmidt S, Peshkin L. et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7:248–9. - PMC - PubMed
    1. Arslan S, Garcia FJ, Guo M. et al. Sequencing by avidity enables high accuracy with low reagent consumption. Nat Biotechnol 2023;42:132–8. - PMC - PubMed
    1. Bick AG, Pirruccello JP, Griffin GK. et al. Genetic interleukin 6 signaling deficiency attenuates cardiovascular risk in clonal hematopoiesis. Circulation 2020a;141:124–31. - PMC - PubMed
    1. Bick AG, Weinstock JS, Nandakumar SK. et al. ; NHLBI Trans-Omics for Precision Medicine Consortium. Inherited causes of clonal hematopoiesis in 97,691 TOPMed whole genomes. Nature 2020b;586:763–8. - PMC - PubMed
    1. Bolton KL, Ptashkin RN, Gao T. et al. Cancer therapy shapes the fitness landscape of clonal hematopoiesis. Nat Genet 2020;52:1219–26. - PMC - PubMed

Publication types