Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov;55(11):e70113.
doi: 10.1111/eci.70113. Epub 2025 Sep 8.

Influence of medical educational background on the diagnostic quality of ChatGPT-4 responses in internal medicine: A pilot study

Affiliations

Influence of medical educational background on the diagnostic quality of ChatGPT-4 responses in internal medicine: A pilot study

Nicolò Gilardi et al. Eur J Clin Invest. 2025 Nov.

Abstract

This pilot study evaluated the influence of medical background on the diagnostic quality of ChatGPT-4's responses in Internal Medicine. Third-year students, residents and specialists summarised five complex NEJM clinical cases before querying ChatGPT-4. Diagnostic ranking, assessed by independent experts, revealed that residents significantly outperformed students (OR 2.33, p = .007); though overall performance was low. These findings indicate that user expertise and concise case summaries are critical for optimising AI diagnostics, highlighting the need for enhanced AI training and user interaction strategies.

Keywords: ChatGPT‐4; artificial intelligence; clinical decision making; diagnostic ranking; internal medicine; large language models.

PubMed Disclaimer

Conflict of interest statement

Two study co‐authors of this manuscript are members of the Editorial Board of European Journal of Clinical Investigation. Fabrizio Montecucco is Editor in Chief of European Journal of Clinical Investigation and a co‐author of this article; Federico Carbone is an Editorial Board member of European Journal of Clinical Investigation and a co‐author of this article. To minimize bias, they were excluded from all editorial decision‐making related to the acceptance of this article for publication. The remaining authors declare no significant conflict of interest with this study.

Figures

FIGURE 1
FIGURE 1
Study design flowchart.
FIGURE 2
FIGURE 2
Interrater agreement and study results. (A) Scatter plots of two‐by‐two rater scores for the variable ‘diagnostic ranking’; the red dashed line represents the least square regression, with grey shades highlighting the 95% confidence intervals of the correlations. Spearman's rank coefficients are .91, .91 and .92, respectively. (B) Violin plots of the diagnostic ranking scores according to the evaluating groups; grey dots represent individual diagnostic ranking scores assigned by the independent raters, whereas the thick red dots represent group means. Data distributions are highlighted by the steel blue shades. (C) Forest plot of odds ratios for each represented contrast of groups; *p‐value <.05. Residents versus Students: OR 2.33, 95% confidence interval (CI) = 1.27–4.28, p‐value = .007. Consultants versus Students: OR 1.42, 95% CI = .77–2.61, p‐value = .258.

References

    1. Fazal MI, Patel ME, Tye J, Gupta Y. The past, present and future role of artificial intelligence in imaging. Eur J Radiol. 2018;105:246‐250. doi: 10.1016/j.ejrad.2018.06.020 - DOI - PubMed
    1. McGenity C, Clarke EL, Jennings C, et al. Artificial intelligence in digital pathology: a systematic review and meta‐analysis of diagnostic test accuracy. NPJ Digit Med. 2024;7(1):114. - PMC - PubMed
    1. Kaul V, Enslin S, Gross SA. History of artificial intelligence in medicine. Gastrointest Endosc. 2020;92(4):807‐812. - PubMed
    1. Wolfram DA. An appraisal of INTERNIST‐I. Artif Intell Med. 1995;7(2):93‐116. - PubMed
    1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930‐1940. - PubMed

LinkOut - more resources