Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Eden Avnat^{1

2}, Michal Levy^{3

4}, Daniel Herstain¹, Elia Yanko⁵, Daniel Ben Joya^{2

6}, Michal Tzuchman Katz², Dafna Eshel², Sahar Laros^{1

2}, Yael Dagan^{1

2}, Shahar Barami^{1

2}, Joseph Mermelstein², Shahar Ovadia², Noam Shomron¹, Varda Shalev¹, Raja-Elie E Abdulnour⁷

Affiliations

¹ Faculty of Medicine, Tel Aviv University, Chaim Levanon St 55, Tel Aviv, 6997801, Israel, 972 545299622.
² Kahun Medical Ltd, Givatayim, Israel.
³ Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.
⁴ School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
⁵ The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel.
⁶ Kaplan Medical Center, Rehovot, Israel.
⁷ Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States.

PMID: 40658983
PMCID: PMC12279315
DOI: 10.2196/64452

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Eden Avnat et al. J Med Internet Res. 2025.

. 2025 Jul 14:27:e64452.

doi: 10.2196/64452.

Authors

Affiliations

¹ Faculty of Medicine, Tel Aviv University, Chaim Levanon St 55, Tel Aviv, 6997801, Israel, 972 545299622.
² Kahun Medical Ltd, Givatayim, Israel.
³ Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.
⁴ School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
⁵ The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel.
⁶ Kaplan Medical Center, Rehovot, Israel.
⁷ Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States.

PMID: 40658983
PMCID: PMC12279315
DOI: 10.2196/64452

Abstract

Background: Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization.

Objective: This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans.

Methods: To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models.

Results: In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13).

Conclusions: Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.

Keywords: benchmark; dataset; evidence-based medicine; large language models; questions and answers.

© Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E Abdulnour. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: The authors EA, ML, DH, DBJ, MTK, DE, SL, YD, SB, JM, and SO are paid employees by Kahun Ltd. All other authors declare no financial or non-financial competing interests.

Figures

Figure 1.. Flowchart of the study: from Kahun's knowledge graph, which references source, target, and background as edges of the graph (1-2), to the evidence-based medicine question and answer dataset and the large language model benchmarking (3-4), which includes both numeric and semantic questions and answers.

See this image and copyright information in PMC

References

1. Custers E. Thirty years of illness scripts: theoretical origins and practical applications. Med Teach. 2015 May;37(5):457–462. doi: 10.3109/0142159X.2014.956052. doi. Medline. - DOI - PubMed
1. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006 Nov 23;355(21):2217–2225. doi: 10.1056/NEJMra054782. doi. Medline. - DOI - PubMed
1. McGee S. Simplifying likelihood ratios. J Gen Intern Med. 2002 Aug;17(8):646–649. doi: 10.1046/j.1525-1497.2002.10750.x. doi. Medline. - DOI - PMC - PubMed
1. Cullen RJ. In search of evidence: family practitioners’ use of the Internet for clinical information. J Med Libr Assoc. 2002 Oct;90(4):370–379. Medline. - PMC - PubMed
1. Fourcade A, Khonsari RH. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019 Sep;120(4):279–288. doi: 10.1016/j.jormas.2019.06.002. doi. Medline. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Affiliations

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous