Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 1;13(1):16492.
doi: 10.1038/s41598-023-43436-9.

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Affiliations

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Dana Brin et al. Sci Rep. .

Abstract

The United States Medical Licensing Examination (USMLE) has been a subject of performance study for artificial intelligence (AI) models. However, their performance on questions involving USMLE soft skills remains unexplored. This study aimed to evaluate ChatGPT and GPT-4 on USMLE questions involving communication skills, ethics, empathy, and professionalism. We used 80 USMLE-style questions involving soft skills, taken from the USMLE website and the AMBOSS question bank. A follow-up query was used to assess the models' consistency. The performance of the AI models was compared to that of previous AMBOSS users. GPT-4 outperformed ChatGPT, correctly answering 90% compared to ChatGPT's 62.5%. GPT-4 showed more confidence, not revising any responses, while ChatGPT modified its original answers 82.5% of the time. The performance of GPT-4 was higher than that of AMBOSS's past users. Both AI models, notably GPT-4, showed capacity for empathy, indicating AI's potential to meet the complex interpersonal, ethical, and professional demands intrinsic to the practice of medicine.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Performance of ChatGPT and GPT-4 on USMLE sample exam and AMBOSS questions.

References

    1. Jiang LY, et al. Health system-scale language models are all-purpose prediction engines. Nature. 2023;619:357–362. doi: 10.1038/s41586-023-06160-y. - DOI - PMC - PubMed
    1. Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: Ethical challenges for medical publishing. Lancet Digit. Health. 2023;5:e105–e106. doi: 10.1016/S2589-7500(23)00019-5. - DOI - PubMed
    1. Nazario-Johnson L, Zaki HA, Tung GA. Use of large language models to predict neuroimaging. J. Am. Coll. Radiol. 2023 doi: 10.1016/j.jacr.2023.06.008. - DOI - PubMed
    1. Sorin V, Barash Y, Konen E, Klang E. Large language models for oncological applications. J. Cancer Res. Clin. Oncol. 2023 doi: 10.1007/s00432-023-04824-w. - DOI - PMC - PubMed
    1. Li R, Kumar A, Chen JH. How chatbots and large language model artificial intelligence systems will reshape modern medicine: Fountain of creativity or Pandora’s box? JAMA Intern. Med. 2023;183:596. doi: 10.1001/jamainternmed.2023.1835. - DOI - PMC - PubMed