Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Martin Krusche¹, Johnna Callhoff^{2

3}, Johannes Knitza^{4

5}, Nikolas Ruffer⁶

Affiliations

¹ Division of Rheumatology and Systemic Inflammatory Diseases, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany. m.krusche@uke.de.
² Epidemiology Unit, German Rheumatism Research Centre, Berlin, Germany.
³ Institute for Social Medicine, Epidemiology and Health Economics, Charité Universitätsmedizin, Berlin, Germany.
⁴ Institute of Digital Medicine, University Hospital of Giessen and Marburg, Philipps University Marburg, Marburg, Germany.
⁵ Université Grenoble Alpes, AGEIS, Grenoble, France.
⁶ Division of Rheumatology and Systemic Inflammatory Diseases, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany.

PMID: 37742280
PMCID: PMC10796566
DOI: 10.1007/s00296-023-05464-6

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Martin Krusche et al. Rheumatol Int. 2024 Feb.

. 2024 Feb;44(2):303-306.

doi: 10.1007/s00296-023-05464-6. Epub 2023 Sep 24.

Authors

Martin Krusche¹, Johnna Callhoff^{2

3}, Johannes Knitza^{4

5}, Nikolas Ruffer⁶

Affiliations

¹ Division of Rheumatology and Systemic Inflammatory Diseases, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany. m.krusche@uke.de.
² Epidemiology Unit, German Rheumatism Research Centre, Berlin, Germany.
³ Institute for Social Medicine, Epidemiology and Health Economics, Charité Universitätsmedizin, Berlin, Germany.
⁴ Institute of Digital Medicine, University Hospital of Giessen and Marburg, Philipps University Marburg, Marburg, Germany.
⁵ Université Grenoble Alpes, AGEIS, Grenoble, France.
⁶ Division of Rheumatology and Systemic Inflammatory Diseases, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany.

PMID: 37742280
PMCID: PMC10796566
DOI: 10.1007/s00296-023-05464-6

Abstract

Pre-clinical studies suggest that large language models (i.e., ChatGPT) could be used in the diagnostic process to distinguish inflammatory rheumatic (IRD) from other diseases. We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to rheumatologists. For the analysis, the data set of Gräf et al. (2022) was used. Previous patient assessments were analyzed using ChatGPT-4 and compared to rheumatologists' assessments. ChatGPT-4 listed the correct diagnosis comparable often to rheumatologists as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the rheumatologists' analysis. Correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (rheumatologists). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the rheumatologists' analysis. Correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the rheumatologists group. If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). ChatGPT-4 showed a slightly higher accuracy for the top 3 overall diagnoses compared to rheumatologist's assessment. ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than rheumatologist, at the cost of lower specificity. The pilot results highlight the potential of this new technology as a triage tool for the diagnosis of IRD.

Keywords: Artificial intelligence; ChatGPT; Diagnostic process; Large language models; Rheumatology; Triage.

PubMed Disclaimer

Figures

**Fig. 1**
Percentage correctly classified diagnosis rank

See this image and copyright information in PMC

References

1. Rheumadocs und Arbeitskreis Junge Rheumatologie (AGJR), Krusche M, Sewerin P, Kleyer A, Mucke J, Vossen D, u. a. Facharztweiterbildung quo vadis? Z Für Rheumatol. Oktober 2019;78(8):692–7. - PubMed
1. Miloslavsky EM, Marston B. The challenge of addressing the rheumatology workforce shortage. J Rheumatol Juni. 2022;49(6):555–557. doi: 10.3899/jrheum.220300. - DOI - PubMed
1. Fuchs F, Morf H, Mohn J, Mühlensiepen F, Ignatyev Y, Bohr D. Diagnostic delay stages and pre-diagnostic treatment in patients with suspected rheumatic diseases before special care consultation: results of a multicenter-based study. Rheumatol Int März. 2023;43(3):495–502. doi: 10.1007/s00296-022-05223-z. - DOI - PMC - PubMed
1. Knitza J, Mohn J, Bergmann C, Kampylafka E, Hagen M, Bohr D. Accuracy, patient-perceived usability, and acceptance of two symptom checkers (Ada and Rheport) in rheumatology: interim results from a randomized controlled crossover trial. Arthritis Res Ther. 2021;23(1):112. doi: 10.1186/s13075-021-02498-8. - DOI - PMC - PubMed
1. Gräf M, Knitza J, Leipe J, Krusche M, Welcker M, Kuhn S. Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. Rheumatol Int. 2022;42(12):2167–2176. doi: 10.1007/s00296-022-05202-4. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Affiliations

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources