Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

Dik Wai Anderson Luk¹, Whitney Chin Tung Ip, Yat-Fung Shea

Affiliations

PMID: 38305423
DOI: 10.1097/JCMA.0000000000001064

Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

Dik Wai Anderson Luk et al. J Chin Med Assoc. 2024.

. 2024 Mar 1;87(3):259-260.

doi: 10.1097/JCMA.0000000000001064. Epub 2024 Feb 2.

Authors

Dik Wai Anderson Luk¹, Whitney Chin Tung Ip, Yat-Fung Shea

Affiliation

¹ Department of Medicine, Queen Mary Hospital, University of Hong Kong, Hong Kong, China.

PMID: 38305423
DOI: 10.1097/JCMA.0000000000001064

Abstract

Artificial intelligence has demonstrated a promising potential for diagnosing complex medical cases, with Generative Pre-Trained Transformer 4 (GPT-4) being the most recent advancement in this field. This study evaluated the diagnostic performance of the GPT-4 in comparison with that of its predecessor, GPT-3.5, using 81 complex medical case records from the New England Journal of Medicine . The cases were categorized as cognitive impairment, infectious disease, rheumatology, or drug reactions. The GPT-4 achieved a primary diagnostic accuracy of 38.3%, which improved to 71.6% when differential diagnoses were included. In 84.0% of cases, primary diagnoses were made by conducting investigations suggested by GPT-4. GPT-4 outperformed GPT-3.5 in all subspecialties except for drug reactions. GPT-4 demonstrated the highest performance in infectious diseases and drug reactions, whereas it underperformed in cases of cognitive impairment. These findings indicate that GPT-4 can provide reasonably accurate diagnoses, comprehensive differential diagnoses, and appropriate investigations. However, its performance varies across subspecialties.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest: The authors declare that they have no conflicts of interest related to the subject matter or materials discussed in this article.

References

1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med 2023;388:1233–9.
1. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023;330:78–80.
1. Shea YF, Lee CMY, Ip WCT, Luk DWA, Wong SSW. Use of GPT-4 to analyze medical records of patients with extensive investigations and delayed diagnosis. JAMA Netw Open 2023;6:e2325000.
1. Shea YF, Ma NC. Limitations of GPT-4 in analyzing real-life medical notes related to cognitive impairment. Psychogeriatrics 2023;23:885–7.

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Wolters Kluwer
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

Affiliation

Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

Authors

Affiliation

Abstract

Conflict of interest statement

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical