Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 9:27:e77334.
doi: 10.2196/77334.

Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study

Affiliations

Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study

Hongbin Yu et al. J Med Internet Res. .

Abstract

Background: Rare hematologic diseases are frequently underdiagnosed or misdiagnosed due to their clinical complexity. Whether new-generation large language models (LLMs), particularly those using chain-of-thought reasoning, can improve diagnostic accuracy remains unclear.

Objective: This study aimed to evaluate the diagnostic performance of new-generation commercial LLMs in rare hematologic diseases and to determine whether the LLM output enhances physicians' diagnostic accuracy.

Methods: We conducted a 2-phase study. In the retrospective phase, we evaluated 7 mainstream LLMs on 158 nonpublic real-world admission records covering 9 rare hematologic diseases, assessed diagnostic performance using top-10 accuracy and mean reciprocal rank (MRR), and evaluated ranking stability via Jaccard similarity and entropy. Spearman rank correlation was used to examine the association between physicians' diagnoses and LLM-generated outputs. In the prospective phase, 28 physicians with varying levels of experience diagnosed 5 cases each, gaining access to LLM-generated diagnoses across 3 sequential steps to assess whether LLMs can improve diagnostic accuracy.

Results: In the retrospective phase, ChatGPT-o1-preview demonstrated the highest top-10 accuracy (70.3%) and MRR (0.577), and DeepSeek-R1 ranked second. Diagnostic performance was low for amyloid light-chain (AL) amyloidosis; Castleman disease; Erdheim-Chester disease; and polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome. Interestingly, higher accuracy often correlated with lower ranking stability across most LLMs. The physician performance showed a strong correlation with both top-10 accuracy (ρ=0.565) and MRR (ρ=0.650). In the prospective phase, LLMs significantly improved the diagnostic accuracy of less-experienced physicians; no significant benefit was observed for specialists. However, when LLMs generated biased responses, physician performance often failed to improve or even declined.

Conclusions: Without fine-tuning, new-generation commercial LLMs, particularly those with chain-of-thought reasoning, can identify diagnoses of rare hematologic diseases with high accuracy and significantly enhance the diagnostic performance of less-experienced physicians. Nevertheless, biased LLM outputs may mislead clinicians, highlighting the need for critical appraisal and cautious clinical integration with appropriate safeguard systems.

Keywords: AI; ChatGPT; LLM; artificial intelligence; hematology; large language model; rare hematologic disease.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1.
Figure 1.. Study design, including a retrospective analysis of large language model (LLM) diagnostic performance and a prospective evaluation of the impact of LLM-generated information on physician performance.
Figure 2.
Figure 2.. Top-10 accuracy of different large language models (LLMs) in diagnosing rare hematologic diseases. (A and B) Overall top-10 accuracy of selected LLMs. (C–I) Top-10 accuracy of LLM models for different rare hematologic diseases: (C) Claude 3.5 Sonnet, (D) DeepSeek-R1, (E) Doubao-1.5-Pro-256k, (F) Gemini Experimental 1206, (G) ChatGPT-4o, (H) ChatGPT-o1-preview, (I) Qwen-Max-2025-01-25. AL amyloidosis: amyloid light-chain amyloidosis; CTCL: cutaneous T-cell lymphoma; ECD: Erdheim-Chester disease; LCH: Langerhans cell histiocytosis; POEMS: polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes; TTP: thrombotic thrombocytopenic purpura; WM: Waldenström macroglobulinemia.
Figure 3.
Figure 3.. Mean reciprocal rank (MRR) of different large language models (LLMs) in diagnosing rare hematologic diseases. (A and B) Overall MRR of selected LLMs. (C–I) MRR of LLM models for different rare hematologic diseases: (C) Claude 3.5 Sonnet, (D) DeepSeek-R1, (E) Doubao-1.5-Pro-256k, (F) Gemini Experimental 1206, (G) ChatGPT-4o, (H) ChatGPT-o1-preview, (I) Qwen-Max-2025-01-25. AL amyloidosis: amyloid light-chain amyloidosis; CTCL: cutaneous T-cell lymphoma; ECD: Erdheim-Chester disease; LCH: Langerhans cell histiocytosis; POEMS: polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes; TTP: thrombotic thrombocytopenic purpura; WM: Waldenström macroglobulinemia.
Figure 4.
Figure 4.. Stability of large language model diagnostic outputs for rare hematologic diseases. Stability was evaluated using (A) Jaccard similarity and (B) entropy across models. Higher accuracy generally coincided with lower stability, except for Claude 3.5 Sonnet.
Figure 5.
Figure 5.. Spearman rank correlation between physician diagnostic performance and large language model outputs. (A) Correlation between physician scores and top-10 accuracy. (B) Correlation between physician scores and mean reciprocal rank (MRR). Physician performance showed a strong correlation with both metrics.
Figure 6.
Figure 6.. Results of the prospective study. Data points in this figure were jittered for visualization. (A) Scores from all physicians across three answers. (B) Scores by physicians stratified by experience level and answers. (C) Score differences between the second and first answers and between the third and first answers among all physicians. (D) Score differences between the answers 2 and 1 and between answers 3 and 1 by experience level. (E) Subjective ratings of large language model–generated information by physicians with different experience levels. (F) Forest plot showing the impact of biased responses on physician performance improvement. (G) Forest plot showing the impact of biased responses on subjective ratings of LLM-generated information. *P<.05; **P<.01; ***P<.001. ns: not significant.

References

    1. Faye F, Crocione C, Anido de Peña R, et al. Time to diagnosis and determinants of diagnostic delays of people living with a rare disease: results of a rare barometer retrospective patient survey. Eur J Hum Genet. 2024 Sep;32(9):1116–1126. doi: 10.1038/s41431-024-01604-z. doi. Medline. - DOI - PMC - PubMed
    1. Marwaha S, Knowles JW, Ashley EA. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med. 2022 Feb 28;14(1):23. doi: 10.1186/s13073-022-01026-w. doi. Medline. - DOI - PMC - PubMed
    1. Dou X, Liu Y, Liao A, et al. Patient journey toward a diagnosis of light chain amyloidosis in a national sample: cross-sectional web-based study. JMIR Form Res. 2023 Nov 2;7:e44420. doi: 10.2196/44420. doi. Medline. - DOI - PMC - PubMed
    1. Estrada-Veras JI, O’Brien KJ, Boyd LC, et al. The clinical spectrum of Erdheim-Chester disease: an observational cohort study. Blood Adv. 2017 Feb 14;1(6):357–366. doi: 10.1182/bloodadvances.2016001784. doi. Medline. - DOI - PMC - PubMed
    1. Nozza A. POEMS syndrome: an update. Mediterr J Hematol Infect Dis. 2017;9(1):e2017051. doi: 10.4084/MJHID.2017.051. doi. Medline. - DOI - PMC - PubMed

LinkOut - more resources