MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso¹, Maite Oronoz², Rodrigo Agerri³

Affiliations

¹ HiTZ Center - Ixa, University of the Basque Country UPV/EHU, Spain. Electronic address: inigoborja.alonso@ehu.eus.
² HiTZ Center - Ixa, University of the Basque Country UPV/EHU, Spain. Electronic address: maite.oronoz@ehu.eus.
³ HiTZ Center - Ixa, University of the Basque Country UPV/EHU, Spain. Electronic address: rodrigo.agerri@ehu.eus.

PMID: 39121544
DOI: 10.1016/j.artmed.2024.102938

Free article

MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso et al. Artif Intell Med. 2024 Sep.

Free article

. 2024 Sep:155:102938.

doi: 10.1016/j.artmed.2024.102938. Epub 2024 Jul 31.

Authors

Iñigo Alonso¹, Maite Oronoz², Rodrigo Agerri³

Affiliations

¹ HiTZ Center - Ixa, University of the Basque Country UPV/EHU, Spain. Electronic address: inigoborja.alonso@ehu.eus.
² HiTZ Center - Ixa, University of the Basque Country UPV/EHU, Spain. Electronic address: maite.oronoz@ehu.eus.
³ HiTZ Center - Ixa, University of the Basque Country UPV/EHU, Spain. Electronic address: rodrigo.agerri@ehu.eus.

PMID: 39121544
DOI: 10.1016/j.artmed.2024.102938

Abstract

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.¹.

Keywords: Large Language Models; Medical Question Answering; Multilinguality; Natural Language Processing; Retrieval Augmented Generation.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Rodrigo Agerri reports financial support was provided by Spain Ministry of Science and Innovation. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Cited by

Application of artificial intelligence chatbots in interpreting magnetic resonance imaging reports: a comparative study.
Bai X, Feng M, Ma W, Liao Y. Bai X, et al. Sci Rep. 2025 Aug 25;15(1):31266. doi: 10.1038/s41598-025-17355-w. Sci Rep. 2025. PMID: 40855279 Free PMC article.
PodGPT: an audio-augmented large language model for research and education.
Jia S, Bit S, Searls E, Lauber MV, Fan P, Wang WM, Claus LA, Jasodanand VH, Veerapaneni D, Au R, Kolachalama VB. Jia S, et al. NPJ Biomed Innov. 2025;2(1):26. doi: 10.1038/s44385-025-00022-0. Epub 2025 Jul 7. NPJ Biomed Innov. 2025. PMID: 40636401 Free PMC article.
AI-Based EMG Reporting: A Randomized Controlled Trial.
Gorenshtein A, Weisblat Y, Khateb M, Kenan G, Tsirkin I, Fayn G, Geller S, Shelly S. Gorenshtein A, et al. J Neurol. 2025 Aug 22;272(9):586. doi: 10.1007/s00415-025-13261-3. J Neurol. 2025. PMID: 40844612 Free PMC article. Clinical Trial.
Retrieval augmented generation for large language models in healthcare: A systematic review.
Amugongo LM, Mascheroni P, Brooks S, Doering S, Seidel J. Amugongo LM, et al. PLOS Digit Health. 2025 Jun 11;4(6):e0000877. doi: 10.1371/journal.pdig.0000877. eCollection 2025 Jun. PLOS Digit Health. 2025. PMID: 40498738 Free PMC article.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- ClinicalKey
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering

Affiliations

MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering

Authors

Affiliations

Abstract

Conflict of interest statement

Similar articles

Cited by

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources