Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 27;141(17):2100-2113.
doi: 10.1182/blood.2022017518.

Differential diagnosis of bone marrow failure syndromes guided by machine learning

Affiliations

Differential diagnosis of bone marrow failure syndromes guided by machine learning

Fernanda Gutierrez-Rodrigues et al. Blood. .

Abstract

The choice to postpone treatment while awaiting genetic testing can result in significant delay in definitive therapies in patients with severe pancytopenia. Conversely, the misdiagnosis of inherited bone marrow failure (BMF) can expose patients to ineffectual and expensive therapies, toxic transplant conditioning regimens, and inappropriate use of an affected family member as a stem cell donor. To predict the likelihood of patients having acquired or inherited BMF, we developed a 2-step data-driven machine-learning model using 25 clinical and laboratory variables typically recorded at the initial clinical encounter. For model development, patients were labeled as having acquired or inherited BMF depending on their genomic data. Data sets were unbiasedly clustered, and an ensemble model was trained with cases from the largest cluster of a training cohort (n = 359) and validated with an independent cohort (n = 127). Cluster A, the largest group, was mostly immune or inherited aplastic anemia, whereas cluster B comprised underrepresented BMF phenotypes and was not included in the next step of data modeling because of a small sample size. The ensemble cluster A-specific model was accurate (89%) to predict BMF etiology, correctly predicting inherited and likely immune BMF in 79% and 92% of cases, respectively. Our model represents a practical guide for BMF diagnosis and highlights the importance of clinical and laboratory variables in the initial evaluation, particularly telomere length. Our tool can be potentially used by general hematologists and health care providers not specialized in BMF, and in under-resourced centers, to prioritize patients for genetic testing or for expeditious treatment.

PubMed Disclaimer

Conflict of interest statement

Conflict-of-interest disclosure: N.S.Y. received research funding from Novartis by way of a Cooperative Research and Development Agreement. R.M.S. received royalties from cad, Ping An, Philips, Scan Med, and Translation Holdings and his laboratory received research support from Ping An and NVIDIA. Y.T. is currently employed by Ping An. The remaining authors declare no competing financial interests.

Figures

None
Graphical abstract
Figure 1.
Figure 1.
Schematic workflow of development of the 2-step machine-learning model. The model was developed with (1) collection of clinical and laboratory data routinely available for patients with BMF from 2 independent cohorts; (2) curation of germ line variants identified by genetic testing in order to assign a label (target classification) for each patient correspondent to BMF etiology: acquired or inherited. All patients identified with pathogenic and likely pathogenic variants were labeled as inherited cases. Patients without germ line variants or with only benign/likely benign variants were labeled as acquired cases. Patients with VUS were not included in the training data set; (3) data preparation; (4) K-means clustering of cases from the training cohort; (5) classification machine-learning algorithm optimized for the cluster with the highest number of cases (cluster A); and (6) validation of the model in an external data set. The predictive model was next applied to predict BMF etiology in patients with VUS.
Figure 2.
Figure 2.
Genetic and clinical characterization of cases from the NIH data set. (A) Germ line variants identified in the NIH data set (n = 399) according to patients’ ages and clinical diagnosis. Variants identified at maximum population frequency of 1% in the general population (gnomAD database) were curated and classified as pathogenic/likely pathogenic (light blue), and as benign, likely benign, or of uncertain significance (VUS; purple). Patients with pathogenic variants in IBMFS genes were labeled as inherited (n = 127). Mutations in genes linked to DBA (n = 9), FA (n = 25), SDS (n = 11), and DC/Hoyeraal-Hreidarsson syndrome (n = 28) were mostly pediatric whereas patients with AA, isolated cytopenias, or MDS/HypoMDS, due to pathogenic variants in telomere biology genes (n = 46) or other genes (RUNX1, n = 1; DDX41, n = 1; and biallelic MPL, n = 1), were in a broader age spectrum. Patients with no variants or with variants classified as benign or likely benign were labeled as acquired (n = 232). In contrast, patients with variants classified as VUS were removed from analysis (n = 40). A final training cohort (n = 359) with 127 labeled as inherited and 232 cases labeled as acquired were used for data modeling. (B) Violin plots of continuous variables in the training cohort (n = 359) according to clusters. Cluster A was enriched for patients who had lower median blood counts, whereas cluster B was enriched for patients with physical anomalies, multiorgan involvement, and long histories of cytopenias or macrocytosis (supplemental Figures 2 and 3). Median ages and blood counts, from both clusters A and B, are shown in the graphic. In general, median blood counts of patients were lower in cluster A than in cluster B and RDW was higher in cluster A than in B, possibly because of enrichment of SAA, which is often transfusion dependent. Within each cluster, inherited cases had lower median ages but higher blood counts. (C) Clinical diagnosis of patients labeled as acquired and inherited in both the training and validation cohorts. Each dot represents a single patient that is colored according to the assigned cluster.
Figure 3.
Figure 3.
Classification model for prediction of BMF etiology in cluster A. (A) Top predictors ranked by importance by the ReliefF method. Feature selection ranked 27 variables by importance and the top 25 variables were considered important predictors for the model. (B) Correlation coefficient (R) between a target of prediction (categorical) and continuous variables. R was calculated and plotted in order of a variable’s importance. (C) A heatmap showing correlation among continuous variables. (D) Confusion matrix with prediction results for the validation cohort. The model was validated in the USP data set. Cases labeled or predicted as acquired are represented by “A,” whereas cases labeled or predicted as inherited are represented by “I.” Model sensitivity represents the ability to correctly predict acquired cases, whereas model specificity is the ability of the model to correctly predict inherited cases. (E) Cases from the cluster A of the USP data set that were misclassified by the model. Cases labeled as acquired or inherited that were correctly predicted by the model are represented with purple circles. Cases labeled as acquired that were predicted as inherited, or labeled as inherited and predicted as acquired are indicated with pink triangles. (F) Prediction results of VUS cases. Results are shown according to clinical diagnosis and mutated genes observed in VUS cases. Germ line VUS were mostly found in TERT (n = 10), SAMD9 or SAMD9L (n = 10), RTEL1 (n = 8), SBF2 (n = 6), and GATA2 (n = 3). Cases predicted as inherited or acquired by the model are represented by red and blue circles, respectively. Of note, SAMD9/L variants are often VUSs because in silico tools do not predict the pathogenicity of gain-of-function variants and many cases are de novo without previous family history. ALC, absolute lymphocyte count; ANC, absolute neutrophil count; BM, bone marrow; Hb, hemoglobin level (g/dL); MCV, mean corpuscular volume.
Figure 4.
Figure 4.
Two-step clustering and classification model for decision making in BMF. In the first step of the model, K-means clustering grouped cases into clusters A and B, which correlated with clinical diagnosis. Cluster A was enriched for cases of FA and DC, patients who had AA at young ages, and cases with AA and single or bilineage cytopenias over a broad spectrum of age but most frequently 20 and 50 years old. In contrast, cluster B was enriched for classical inherited BMF, including early disease onset DBA and SDS, and cases of FA and DC in middle age. In the second step, a classification model specific to cluster A was developed for binary prediction of cases as acquired and inherited. The cluster A–specific algorithm accurately predicted the BMF etiology in 79% of cases with IBMFS (model sensitivity) and 92% of cases with likely immune BMF (specificity) when TL data were available. The model lost accuracy without TL, a top predictive factor. However, in the absence of TL data, IBMFSs were rarely seen in adults with SAA and no family history or a phenotype suggestive of inherited disease; presence of PNH clone >1% within this group had a specificity of 100% for acquired AA. yo, years old.

Comment in

References

    1. Bluteau O, Sebert M, Leblanc T, et al. A landscape of germ line mutations in a cohort of inherited bone marrow failure patients. Blood. 2018;131(7):717–732. - PubMed
    1. Young NS. Aplastic anemia. N Engl J Med. 2018;379(17):1643–1656. - PMC - PubMed
    1. Wegman-Ostrosky T, Savage SA. The genomics of inherited bone marrow failure: from mechanism to the clinic. Br J Haematol. 2017;177(4):526–542. - PubMed
    1. Townsley DM, Dumitriu B, Young NS. Bone marrow failure and the telomeropathies. Blood. 2014;124(18):2775–2783. - PMC - PubMed
    1. Calado RT, Young NS. Telomere diseases. N Engl J Med. 2009;361(24):2353–2365. - PMC - PubMed

MeSH terms