Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr;175(4):936-942.
doi: 10.1016/j.surg.2023.12.014. Epub 2024 Jan 20.

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

Affiliations

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments

Brendin R Beaulieu-Jones et al. Surgery. 2024 Apr.

Abstract

Background: Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.

Methods: We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries.

Results: A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions.

Conclusion: Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest/Disclosure: The authors have no relevant financial disclosures.

Figures

Figure 1:
Figure 1:. Accuracy of ChatGPT Output for Open-Ended and Multiple-Choice Questions
Surgical knowledge questions from SCORE and Data-B were presented to ChatGPT via two formats: open-ended (OE; left side) and multiple-choice (MC; right side). ChatGPT’s outputs were assessed for accuracy by surgeon evaluators. A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% (119/167) and 67.9% (76/112) of multiple choice SCORE and Data-B questions, respectively.

Update of

References

    1. Khalsa RK, Khashkhusha A, Zaidi S, Harky A, Bashir M. Artificial intelligence and cardiac surgery during COVID-19 era. J Card Surg 2021;36(5):1729–1733. doi:10.1111/JOCS.15417 - DOI - PMC - PubMed
    1. Mehta N, Pandit A, Shukla S. Transforming healthcare with big data analytics and artificial intelligence: A systematic mapping study. J Biomed Inform 2019;100. doi:10.1016/J.JBI.2019.103311 - DOI - PubMed
    1. Payrovnaziri SN, Chen Z, Rengifo-Moreno P, et al. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc JAMIA 2020;27(7):1173–1185. doi:10.1093/JAMIA/OCAA053 - DOI - PMC - PubMed
    1. Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc JAMIA 2018;25(10):1419–1428. doi:10.1093/JAMIA/OCY068 - DOI - PMC - PubMed
    1. Luh JY, Thompson RF, Lin S. Clinical Documentation and Patient Care Using Artificial Intelligence in Radiation Oncology. J Am Coll Radiol JACR 2019;16(9 Pt B):1343–1346. doi:10.1016/J.JACR.2019.05.044 - DOI - PubMed