Language Artificial Intelligence Models as Pioneers in Diagnostic Medicine? A Retrospective Analysis on Real-Time Patients

Azka Naeem¹, Omair Khan², Syed Mujtaba Baqir¹, Kundan Jana¹, Prem Shankar¹, Avleen Kaur³, Morad Zaaya¹, Fatima Sajid¹, Fizza Mohsin¹, Marlon Rivera Boadla¹, Aung Oo¹, Victor Wong¹, Momna Noor¹, Samar Pal Singh Sandhu¹, Kseniya Slobodyanuk⁴, Vijay Shetty¹, Aaron Z Tokayer¹

Affiliations

¹ Maimonides Medical Center, Brooklyn, NY 11219, USA.
² Department of Internal Medicine, Division of Rheumatology, University of Nebraska Medical Center, Omaha, NE 68198, USA.
³ Department of Gastroenterology, SUNY Upstate Medial University, Syracuse, NY 13210, USA.
⁴ NYU Langone Health, New York, NY 10016, USA.

PMID: 40004661
PMCID: PMC11856400
DOI: 10.3390/jcm14041131

Language Artificial Intelligence Models as Pioneers in Diagnostic Medicine? A Retrospective Analysis on Real-Time Patients

Azka Naeem et al. J Clin Med. 2025.

. 2025 Feb 10;14(4):1131.

doi: 10.3390/jcm14041131.

Authors

Affiliations

¹ Maimonides Medical Center, Brooklyn, NY 11219, USA.
² Department of Internal Medicine, Division of Rheumatology, University of Nebraska Medical Center, Omaha, NE 68198, USA.
³ Department of Gastroenterology, SUNY Upstate Medial University, Syracuse, NY 13210, USA.
⁴ NYU Langone Health, New York, NY 10016, USA.

PMID: 40004661
PMCID: PMC11856400
DOI: 10.3390/jcm14041131

Abstract

Background/Objectives: GPT-3.5 and GPT-4 has shown promise in assisting healthcare professionals with clinical questions. However, their performance in real-time clinical scenarios remains underexplored. This study aims to evaluate their precision and reliability compared to board-certified emergency department attendings, highlighting their potential in improving patient care. We hypothesized that board-certified emergency department attendings at Maimonides Medical Center exhibit higher accuracy and reliability than GPT-3.5 and GPT-4 in generating differentials based on history and physical examination for patients presenting to the emergency department. Methods: Real-time patient data from Maimonides Medical Center's emergency department, collected from 1 January 2023 to 1 March 2023 were analyzed. Demographic details, symptoms, medical history, and discharge diagnoses recorded by emergency room attendings were examined. AI algorithms (ChatGPT-3.5 and GPT-4) generated differential diagnoses, which were compared with those by attending physicians. Accuracy was determined by comparing each rater's diagnoses with the gold standard discharge diagnosis, calculating the proportion of correctly identified cases. Precision was assessed using Cohen's kappa coefficient and Intraclass Correlation Coefficient to measure agreement between raters. Results: Mean age of patients was 49.12 years, with 57.3% males and 42.7% females. Chief complaints included fever/sepsis (24.7%), gastrointestinal issues (17.7%), and cardiovascular problems (16.4%). Diagnostic accuracy against discharge diagnoses was highest for ChatGPT-4 (85.5%), followed by ChatGPT-3.5 (84.6%) and ED attendings (83%). Cohen's kappa demonstrated moderate agreement (0.7) between AI models, with lower agreement observed for ED attendings. Stratified analysis revealed higher accuracy for gastrointestinal complaints with Chat GPT-4 (87.5%) and cardiovascular complaints with Chat GPT-3.5 (81.34%). Conclusions: Our study demonstrates that Chat GPT-4 and GPT-3.5 exhibit comparable diagnostic accuracy to board-certified emergency department attendings, highlighting their potential to aid decision-making in dynamic clinical settings. The stratified analysis revealed comparable reliability and precision of the AI chat bots for cardiovascular complaints which represents a significant proportion of the high-risk patients presenting to the emergency department and provided targeted insights into rater performance within specific medical domains. This study contributes to integrating AI models into medical practice, enhancing efficiency and effectiveness in clinical decision-making. Further research is warranted to explore broader applications of AI in healthcare.

Keywords: artificial intelligence; cardiology; emergency department; gastrointestinal.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 1**
Flowchart showing the selection of the patients.

See this image and copyright information in PMC

References

1. Lubell J. Chatgpt Passed the USMLE. What Does it Mean for Med Ed? American Medical Association. [(accessed on 13 April 2023)]. Available online: https://www.ama-assn.org/practice-management/digital/chatgpt-passed-usml....
1. Kung T.H., Cheatham M., Medenilla A., Sillos C., De Leon L., Elepaño C., Madriaga M., Aggabao R., Diaz-Candido G., Maningo J., et al. Performance of chatgpt on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health. 2023;2:e0000198. doi: 10.1371/journal.pdig.0000198. - DOI - PMC - PubMed
1. Hirosawa T., Harada Y., Yokose M., Sakamoto T., Kawamura R., Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: A pilot study. Int. J. Environ. Res. Public Health. 2023;20:3378. doi: 10.3390/ijerph20043378. - DOI - PMC - PubMed
1. Lee P., Bubeck S., Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 2023;388:1233–1239. doi: 10.1056/NEJMsr2214184. - DOI - PubMed
1. Attaripour Esfahani S., Baba Ali N., Farina J.M., Scalia I.G., Pereyra M., Abbas M.T., Javadi N., Bismee N.N., Abdelfattah F.E., Awad K., et al. A Comprehensive Review of Artificial Intelligence (AI) Applications in Pulmonary Hypertension (PH) Medicina. 2025;61:85. doi: 10.3390/medicina61010085. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Language Artificial Intelligence Models as Pioneers in Diagnostic Medicine? A Retrospective Analysis on Real-Time Patients

Affiliations

Language Artificial Intelligence Models as Pioneers in Diagnostic Medicine? A Retrospective Analysis on Real-Time Patients

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources