Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Apr 20:2025.04.11.25325686.
doi: 10.1101/2025.04.11.25325686.

REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY

Affiliations

REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY

Asim Waqas et al. medRxiv. .

Abstract

Background: Diagnostic pathology depends on complex, structured reasoning to interpret clinical, histologic, and molecular data. Replicating this cognitive process algorithmically remains a significant challenge. As large language models (LLMs) gain traction in medicine, it is critical to determine whether they have clinical utility by providing reasoning in highly specialized domains such as pathology.

Methods: We evaluated the performance of four reasoning LLMs (OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1 671B) on 15 board-style open-ended pathology questions. Responses were independently reviewed by 11 pathologists using a structured framework that assessed language quality (accuracy, relevance, coherence, depth, and conciseness) and seven diagnostic reasoning strategies. Scores were normalized and aggregated for analysis. We also evaluated inter-observer agreement to assess scoring consistency. Model comparisons were conducted using one-way ANOVA and Tukey's Honestly Significant Difference (HSD) test.

Results: Gemini and DeepSeek significantly outperformed OpenAI o1 and OpenAI o3-mini in overall reasoning quality (p < 0.05), particularly in analytical depth and coherence. While all models achieved comparable accuracy, only Gemini and DeepSeek consistently applied expert-like reasoning strategies, including algorithmic, inductive, and Bayesian approaches. Performance varied by reasoning type: models performed best in algorithmic and deductive reasoning and poorest in heuristic and pattern recognition. Inter-observer agreement was highest for Gemini (p < 0.05), indicating greater consistency and interpretability. Models with more in-depth reasoning (Gemini and DeepSeek) were generally less concise.

Conclusion: Advanced LLMs such as Gemini and DeepSeek can approximate aspects of expert-level diagnostic reasoning in pathology, particularly in algorithmic and structured approaches. However, limitations persist in contextual reasoning, heuristic decision-making, and consistency across questions. Addressing these gaps, along with trade-offs between depth and conciseness, will be essential for the safe and effective integration of AI tools into clinical pathology workflows.

Keywords: AI Evaluation; Clinical Reasoning; Generative AI; Pathology; Reasoning Large Language Models.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Evaluation Framework for Assessing Diagnostic Reasoning in Large Language Models.
Fifteen open-ended diagnostic pathology questions, reflecting the complexity of board licensing examinations, were independently submitted to four LLMs: OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash-Thinking Experimental (Gemini), and DeepSeek-R1 671B (DeepSeek). Each response was evaluated by 11 expert pathologists using a structured rubric comprising 12 metrics across two domains: (1) language quality and response structure (relevance, coherence, accuracy, depth, and conciseness) and (2) diagnostic reasoning strategies (pattern recognition, algorithmic, deductive, inductive/hypothetico-deductive, heuristic, mechanistic, and Bayesian reasoning). Pathologists were blinded to model identification. Evaluation scores were aggregated and normalized to account for missing data and served as the basis for both model performance comparisons and inter-observer agreement analysis.
Figure 2:
Figure 2:. Comparative Performance of LLMs Across Language Quality and Diagnostic Reasoning Domains.
Radar plots summarize the normalized mean scores (range: 0 to 1) assigned by 11 pathologists for each model across 12 evaluation criteria. Panel A: Language quality metrics, including accuracy, relevance, analytical depth, coherence, conciseness, and cumulative scores. Panel B: Diagnostic reasoning strategies, including pattern recognition, algorithmic reasoning, deductive reasoning, inductive/hypothetico-deductive reasoning, heuristic reasoning, mechanistic insights, and Bayesian reasoning. Gemini consistently outperformed other models across both domains, particularly in analytical depth and structured reasoning. OpenAI o1 showed the greatest variability in performance across metrics.
Figure 3:
Figure 3:. LLM Performance on Language Quality Metrics.
Normalized average scores (range: 0 to 1) across five core language quality dimensions—accuracy, relevance, coherence, analytical depth, and conciseness—based on ratings from 11 pathologists across 15 pathology questions. Panels A–F: Overall average performance (A), followed by Accuracy (B), Relevance (C), Coherence (D), Analytical Depth (E), and Conciseness (F). Gemini consistently achieved the highest scores across all metrics, with the greatest variability observed in coherence and analytical depth.
Figure 4:
Figure 4:. Diagnostic Reasoning Performance by Reasoning Type.
Normalized mean scores (range: 0 to 1) for each LLM across seven diagnostic reasoning strategies based on expert evaluation of pathology-related questions. Panels A–H: Cumulative reasoning performance (A), Pattern Recognition (B), Algorithmic Reasoning (C), Deductive Reasoning (D), Inductive/Hypothetico-Deductive Reasoning (E), Bayesian Reasoning (F), Heuristic Reasoning (G), and Mechanistic Insights (H). Gemini and DeepSeek consistently outperformed the OpenAI models across most reasoning types, with particularly strong performance in algorithmic, inductive, and mechanistic reasoning. Heuristic and Bayesian reasoning yielded the lowest scores across all models, reflecting challenges with uncertainty-driven and experiential inference.
Figure 5:
Figure 5:
Percent agreement across 720 unique combinations of question, model, and evaluation criterion (Q–M–C), reflecting the proportion of raters who selected the most common score. Panel A: Distribution of percent agreement across all Q–M–C combinations. Panel B: Model-specific distributions of agreement. Gemini achieved significantly higher inter-observer agreement compared to all other models (p < 0.001), suggesting greater consistency and interpretability of its outputs. No statistically significant differences were observed in pair-wise testing between DeepSeek, OpenAI o1, and OpenAI o3-mini.

Similar articles

References

    1. Pena Gil Patrus and Andrade-Filho Josede Souza. How does a pathologist make a diagnosis? Archives of Pathology and Laboratory Medicine, 133(1):124–132, January 2009. - PubMed
    1. Elemento Olivier, Khozin Sean, and Sternberg Cora N. The use of artificial intelligence for cancer therapeutic decision-making. NEJM AI, page AIra2401164, 2025.
    1. Tripathi Aakash, Waqas Asim, Venkatesan Kavya, Ullah Ehsan, Bui Marilyn, and Rasool Ghulam. 1391 ai-driven extraction of key clinical data from pathology reports to enhance cancer registries. Laboratory Investigation, 105(3), 2025.
    1. Rydzewski Nicholas R, Dinakaran Deepak, Zhao Shuang G, Ruppin Eytan, Turkbey Baris, Citrin Deborah E, and Patel Krishnan R. Comparative evaluation of llms in clinical oncology. NEJM AI, 1(5):AIoa2300151, 2024. - PMC - PubMed
    1. Wu Jiageng, Liu Xiaocong, Li Minghui, Li Wanxin, Su Zichang, Lin Shixu, Garay Lucas, Zhang Zhiyun, Zhang Yujie, Zeng Qingcheng, et al. Clinical text datasets for medical artificial intelligence and large language models—a systematic review. NEJM AI, 1(6):AIra2400012, 2024.

Publication types