Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 14:27:e64452.
doi: 10.2196/64452.

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Affiliations

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Eden Avnat et al. J Med Internet Res. .

Abstract

Background: Clinical problem-solving requires processing of semantic medical knowledge, such as illness scripts, and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate nonlanguage evidence-based answers to clinical questions is inherently limited by tokenization.

Objective: This study aimed to evaluate LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans.

Methods: To generate straightforward multichoice questions and answers (Q and As) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (containing data from more than 50,000 peer-reviewed studies) and created the EBM questions and answers (EBMQAs). EBMQA comprises 105,222 Q and As, categorized by medical topics (eg, medical disciplines) and nonmedical topics (eg, question length), and classified into numerical or semantic types. We benchmarked a dataset of 24,000 Q and As on two state-of-the-art LLMs, GPT-4 (OpenAI) and Claude 3 Opus (Anthropic). We evaluated the LLM's accuracy on semantic and numerical question types and according to sublabeled topics. In addition, we examined the question-answering rate of LLMs by enabling them to choose to abstain from responding to questions. For validation, we compared the results for 100 unrelated numerical EBMQA questions between six human medical experts and the two language models.

Results: In an analysis of 24,542 Q and As, Claude 3 and GPT-4 performed better on semantic Q and As (68.7%, n=1593 and 68.4%, n=1709), respectively. Then on numerical Q and As (61.3%, n=8583 and 56.7%, n=12,038), respectively, with Claude 3 outperforming GPT-4 in numeric accuracy (P<.001). A median accuracy gap of 7% (IQR 5%-10%) was observed between the best and worst sublabels per topic, with different LLMs excelling in different sublabels. Focusing on Medical Discipline sublabels, Claude 3 performed well in neoplastic disorders but struggled with genitourinary disorders (69%, n=676 vs 58%, n=464; P<.0001), while GPT-4 excelled in cardiovascular disorders but struggled with neoplastic disorders (60%, n=1076 vs 53%, n=704; P=.0002). Furthermore, humans (82.3%, n=82.3) surpassed both Claude 3 (64.3%, n=64.3; P<.001) and GPT-4 (55.8%, n=55.8; P<.001) in the validation test. Spearman correlation between question-answering and accuracy rate in both Claude 3 and GPT-4 was insignificant (ρ=0.12, P=.69; ρ=0.43, P=.13).

Conclusions: Both LLMs excelled more in semantic than numerical Q and As, with Claude 3 surpassing GPT-4 in numerical Q and As. However, both LLMs showed inter- and intramodel gaps in different medical aspects and remained inferior to humans. In addition, their ability to respond or abstain from answering a question does not reliably predict how accurately they perform when they do attempt to answer questions. Thus, their medical advice should be addressed carefully.

Keywords: benchmark; dataset; evidence-based medicine; large language models; questions and answers.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: The authors EA, ML, DH, DBJ, MTK, DE, SL, YD, SB, JM, and SO are paid employees by Kahun Ltd. All other authors declare no financial or non-financial competing interests.

Figures

Figure 1.
Figure 1.. Flowchart of the study: from Kahun's knowledge graph, which references source, target, and background as edges of the graph (1-2), to the evidence-based medicine question and answer dataset and the large language model benchmarking (3-4), which includes both numeric and semantic questions and answers.

Similar articles

References

    1. Custers E. Thirty years of illness scripts: theoretical origins and practical applications. Med Teach. 2015 May;37(5):457–462. doi: 10.3109/0142159X.2014.956052. doi. Medline. - DOI - PubMed
    1. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006 Nov 23;355(21):2217–2225. doi: 10.1056/NEJMra054782. doi. Medline. - DOI - PubMed
    1. McGee S. Simplifying likelihood ratios. J Gen Intern Med. 2002 Aug;17(8):646–649. doi: 10.1046/j.1525-1497.2002.10750.x. doi. Medline. - DOI - PMC - PubMed
    1. Cullen RJ. In search of evidence: family practitioners’ use of the Internet for clinical information. J Med Libr Assoc. 2002 Oct;90(4):370–379. Medline. - PMC - PubMed
    1. Fourcade A, Khonsari RH. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019 Sep;120(4):279–288. doi: 10.1016/j.jormas.2019.06.002. doi. Medline. - DOI - PubMed

LinkOut - more resources