Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study

Mateus Rodrigues Alessi¹, Heitor Augusto Gomes¹, Gabriel Oliveira¹, Matheus Lopes de Castro¹, Fabiano Grenteski¹, Leticia Miyashiro¹, Camila do Valle¹, Leticia Tozzini Tavares da Silva¹, Cristina Okamoto¹

Affiliations

PMID: 40607498
PMCID: PMC12223693
DOI: 10.2196/66552

Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study

Mateus Rodrigues Alessi et al. JMIR AI. 2025.

. 2025 May 8:4:e66552.

doi: 10.2196/66552.

Authors

Affiliation

¹ School of Medicine, Universidade Positivo, R. Prof. Pedro Viriato Parigot de Souza, 5300, Curitiba, 81280-330, Brazil, (41) 3317-3010.

PMID: 40607498
PMCID: PMC12223693
DOI: 10.2196/66552

Abstract

Background: Artificial intelligence has advanced significantly in various fields, including medicine, where tools like ChatGPT (GPT) have demonstrated remarkable capabilities in interpreting and synthesizing complex medical data. Since its launch in 2019, GPT has evolved, with version 4.0 offering enhanced processing power, image interpretation, and more accurate responses. In medicine, GPT has been used for diagnosis, research, and education, achieving significant milestones like passing the United States Medical Licensing Examination. Recent studies show that GPT 4.0 outperforms earlier versions and even medical students on medical exams.

Objective: This study aimed to evaluate and compare the performance of GPT versions 3.5 and 4.0 on Brazilian Progress Tests (PT) from 2021 to 2023, analyzing their accuracy compared to medical students.

Methods: A cross-sectional observational study was conducted using 333 multiple-choice questions from the PT, excluding questions with images and those nullified or repeated. All questions were presented sequentially without modification to their structure. The performance of GPT versions was compared using statistical methods and medical students' scores were included for context.

Results: There was a statistically significant difference in total performance scores across the 2021, 2022, and 2023 exams between GPT-3.5 and GPT-4.0 (P=.03). However, this significance did not remain after Bonferroni correction. On average, GPT v3.5 scored 68.4%, whereas v4.0 achieved 87.2%, reflecting an absolute improvement of 18.8% and a relative increase of 27.4% in accuracy. When broken down by subject, the average scores for GPT-3.5 and GPT-4.0, respectively, were as follows: surgery (73.5% vs 88.0%, P=.03), basic sciences (77.5% vs 96.2%, P=.004), internal medicine (61.5% vs 75.1%, P=.14), gynecology and obstetrics (64.5% vs 94.8%, P=.002), pediatrics (58.5% vs 80.0%, P=.02), and public health (77.8% vs 89.6%, P=.02). After Bonferroni correction, only basic sciences and gynecology and obstetrics retained statistically significant differences.

Conclusions: GPT-4.0 demonstrates superior accuracy compared to its predecessor in answering medical questions on the PT. These results are similar to other studies, indicating that we are approaching a new revolution in medicine.

Keywords: AI; ChatGPT; academic performance; accuracy; artificial intelligence; biomedical technology; ethics; exam questions; intelligent systems; medical data; medical education; medical ethics; medical exam; medical school; medical student; observational study.

© Mateus Rodrigues Alessi, Heitor Augusto Gomes, Gabriel Oliveira, Matheus Lopes de Castro, Fabiano Grenteski, Leticia Miyashiro, Camila do Valle, Leticia Tozzini Tavares da Silva, Cristina Okamoto. Originally published in JMIR AI (https://ai.jmir.org).

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1.. Comparison of performance accuracy of GPT-3.5, GPT-4.0, and medical students’ score by year.**

**Figure 2.. Comparison of performance accuracy of GPT-3.5, GPT-4.0, and sixth year medical students’ score by year.**

**Figure 3.. Comparison of overall accuracy scores between GPT-3.5, GPT 4.0, and medical students’ score in the 2021 progress test.**

**Figure 4.. Comparison of overall accuracy scores between GPT-3.5, GPT-4.0, and medical students’ score in the 2022 progress test.**

**Figure 5.. Comparison of overall accuracy scores between GPT-3.5, GPT-4.0, and medical student’s score in the 2023 progress test.**

**Figure 6.. Comparison of GPT-3.5, GPT-4.0, and medical students’ combined scores in the 2022-2023 progress test.**

**Figure 7.. Comparison of GPT-3.5, GPT-4.0, and sixth-year medical students’ average subject accuracy score of the 2021-2023 progress test.**

**Figure 8.. Comparison of GPT-3.5, GPT-4.0, and sixth-year medical student’s radar score in the 2022-2023 progress test.**

See this image and copyright information in PMC

References

1. ChatGPT 2024. [28-07-2024]. https://chatgpt.com URL. Accessed.
1. Au K, Yang W. Auxiliary use of ChatGPT in surgical diagnosis and treatment. Int J Surg. 2023 Dec 1;109(12):3940–3943. doi: 10.1097/JS9.0000000000000686. doi. Medline. - DOI - PMC - PubMed
1. Hassan AM, Rajesh A, Asaad M, et al. Artificial intelligence and machine learning in prediction of surgical complications: current state, applications, and implications. Am Surg. 2023 Jan;89(1):25–30. doi: 10.1177/00031348221101488. doi. Medline. - DOI - PMC - PubMed
1. Koohi-Moghadam M, Bae KT. Generative AI in medical imaging: applications, challenges, and ethics. J Med Syst. 2023 Aug 31;47(1):94. doi: 10.1007/s10916-023-01987-4. doi. Medline. - DOI - PubMed
1. Chenais G, Lagarde E, Gil-Jardiné C. Artificial intelligence in emergency medicine: viewpoint of current applications and foreseeable opportunities and challenges. J Med Internet Res. 2023 May 23;25:e40031. doi: 10.2196/40031. doi. Medline. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study

Affiliation

Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources