This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Apr 20:2025.04.11.25325686.

doi: 10.1101/2025.04.11.25325686.

REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY

Asim Waqas¹, Asma Khan², Zarifa Gahramanli Ozturk³, Daryoush Saeed-Vafa⁴, Weishen Chen⁵, Jasreman Dhillon⁶, Andrey Bychkov⁷, Marilyn M Bui⁴, Ehsan Ullah⁸, Farah Khalil⁴, Vaibhav Chumbalkar⁴, Zena Jameel⁴, Humberto Trejo Bittar⁴, Rajendra S Singh⁹, Anil V Parwani¹⁰, Matthew B Schabath¹¹, Ghulam Rasool¹²

Affiliations

¹ Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL.
² Armed Forces Institute of Pathology, Rawalpindi, Pakistan.
³ Clinical Science Lab, H. Lee Moffitt Cancer Center & Research Institute.
⁴ Department of Pathology, H. Lee Moffitt Cancer Center & Research Institute.
⁵ Department of Dermatology & Cutaneous Surgery, University of South Florida, Tampa, FL.
⁶ Department of Anatomic Pathology, H. Lee Moffitt Cancer Center & Research Institute.
⁷ Department of Pathology, Kameda Medical Center, Kamogawa City, Chiba Prefecture, Japan.
⁸ Department of Surgery, Health New Zealand, Counties Manukau, Auckland, New Zealand.
⁹ Dermatopathology and Digital Pathology, Summit Health, Berkley Heights, NJ.
¹⁰ Department of Pathology, The Ohio State University, Columbus, Ohio.
¹¹ Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center & Research Institute.
¹² Department of Machine Learning, H. Lee Moffitt Cancer Center & Research Institute.

PMID: 40297448
PMCID: PMC12036407
DOI: 10.1101/2025.04.11.25325686

REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY

Asim Waqas et al. medRxiv. 2025.

[Preprint]. 2025 Apr 20:2025.04.11.25325686.

doi: 10.1101/2025.04.11.25325686.

Authors

Affiliations

¹ Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL.
² Armed Forces Institute of Pathology, Rawalpindi, Pakistan.
³ Clinical Science Lab, H. Lee Moffitt Cancer Center & Research Institute.
⁴ Department of Pathology, H. Lee Moffitt Cancer Center & Research Institute.
⁵ Department of Dermatology & Cutaneous Surgery, University of South Florida, Tampa, FL.
⁶ Department of Anatomic Pathology, H. Lee Moffitt Cancer Center & Research Institute.
⁷ Department of Pathology, Kameda Medical Center, Kamogawa City, Chiba Prefecture, Japan.
⁸ Department of Surgery, Health New Zealand, Counties Manukau, Auckland, New Zealand.
⁹ Dermatopathology and Digital Pathology, Summit Health, Berkley Heights, NJ.
¹⁰ Department of Pathology, The Ohio State University, Columbus, Ohio.
¹¹ Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center & Research Institute.
¹² Department of Machine Learning, H. Lee Moffitt Cancer Center & Research Institute.

PMID: 40297448
PMCID: PMC12036407
DOI: 10.1101/2025.04.11.25325686

Abstract

Background: Diagnostic pathology depends on complex, structured reasoning to interpret clinical, histologic, and molecular data. Replicating this cognitive process algorithmically remains a significant challenge. As large language models (LLMs) gain traction in medicine, it is critical to determine whether they have clinical utility by providing reasoning in highly specialized domains such as pathology.

Methods: We evaluated the performance of four reasoning LLMs (OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1 671B) on 15 board-style open-ended pathology questions. Responses were independently reviewed by 11 pathologists using a structured framework that assessed language quality (accuracy, relevance, coherence, depth, and conciseness) and seven diagnostic reasoning strategies. Scores were normalized and aggregated for analysis. We also evaluated inter-observer agreement to assess scoring consistency. Model comparisons were conducted using one-way ANOVA and Tukey's Honestly Significant Difference (HSD) test.

Results: Gemini and DeepSeek significantly outperformed OpenAI o1 and OpenAI o3-mini in overall reasoning quality (p < 0.05), particularly in analytical depth and coherence. While all models achieved comparable accuracy, only Gemini and DeepSeek consistently applied expert-like reasoning strategies, including algorithmic, inductive, and Bayesian approaches. Performance varied by reasoning type: models performed best in algorithmic and deductive reasoning and poorest in heuristic and pattern recognition. Inter-observer agreement was highest for Gemini (p < 0.05), indicating greater consistency and interpretability. Models with more in-depth reasoning (Gemini and DeepSeek) were generally less concise.

Conclusion: Advanced LLMs such as Gemini and DeepSeek can approximate aspects of expert-level diagnostic reasoning in pathology, particularly in algorithmic and structured approaches. However, limitations persist in contextual reasoning, heuristic decision-making, and consistency across questions. Addressing these gaps, along with trade-offs between depth and conciseness, will be essential for the safe and effective integration of AI tools into clinical pathology workflows.

Keywords: AI Evaluation; Clinical Reasoning; Generative AI; Pathology; Reasoning Large Language Models.

PubMed Disclaimer

Figures

**Figure 1:. Evaluation Framework for Assessing Diagnostic Reasoning in Large Language Models.**
Fifteen open-ended diagnostic pathology questions, reflecting the complexity of board licensing examinations, were independently submitted to four LLMs: OpenAI o1, OpenAI o3-mini, Gemini 2.0 Flash-Thinking Experimental (Gemini), and DeepSeek-R1 671B (DeepSeek). Each response was evaluated by 11 expert pathologists using a structured rubric comprising 12 metrics across two domains: (1) language quality and response structure (relevance, coherence, accuracy, depth, and conciseness) and (2) diagnostic reasoning strategies (pattern recognition, algorithmic, deductive, inductive/hypothetico-deductive, heuristic, mechanistic, and Bayesian reasoning). Pathologists were blinded to model identification. Evaluation scores were aggregated and normalized to account for missing data and served as the basis for both model performance comparisons and inter-observer agreement analysis.

**Figure 2:. Comparative Performance of LLMs Across Language Quality and Diagnostic Reasoning Domains.**
Radar plots summarize the normalized mean scores (range: 0 to 1) assigned by 11 pathologists for each model across 12 evaluation criteria. **Panel A:** Language quality metrics, including accuracy, relevance, analytical depth, coherence, conciseness, and cumulative scores. **Panel B:** Diagnostic reasoning strategies, including pattern recognition, algorithmic reasoning, deductive reasoning, inductive/hypothetico-deductive reasoning, heuristic reasoning, mechanistic insights, and Bayesian reasoning. Gemini consistently outperformed other models across both domains, particularly in analytical depth and structured reasoning. OpenAI o1 showed the greatest variability in performance across metrics.

**Figure 3:. LLM Performance on Language Quality Metrics.**
Normalized average scores (range: 0 to 1) across five core language quality dimensions—accuracy, relevance, coherence, analytical depth, and conciseness—based on ratings from 11 pathologists across 15 pathology questions. **Panels A–F:** Overall average performance (A), followed by Accuracy (B), Relevance (C), Coherence (D), Analytical Depth (E), and Conciseness (F). Gemini consistently achieved the highest scores across all metrics, with the greatest variability observed in coherence and analytical depth.

**Figure 4:. Diagnostic Reasoning Performance by Reasoning Type.**
Normalized mean scores (range: 0 to 1) for each LLM across seven diagnostic reasoning strategies based on expert evaluation of pathology-related questions. **Panels A–H:** Cumulative reasoning performance (A), Pattern Recognition (B), Algorithmic Reasoning (C), Deductive Reasoning (D), Inductive/Hypothetico-Deductive Reasoning (E), Bayesian Reasoning (F), Heuristic Reasoning (G), and Mechanistic Insights (H). Gemini and DeepSeek consistently outperformed the OpenAI models across most reasoning types, with particularly strong performance in algorithmic, inductive, and mechanistic reasoning. Heuristic and Bayesian reasoning yielded the lowest scores across all models, reflecting challenges with uncertainty-driven and experiential inference.

**Figure 5:**
Percent agreement across 720 unique combinations of question, model, and evaluation criterion (Q–M–C), reflecting the proportion of raters who selected the most common score. **Panel A:** Distribution of percent agreement across all Q–M–C combinations. **Panel B:** Model-specific distributions of agreement. Gemini achieved significantly higher inter-observer agreement compared to all other models (p < 0.001), suggesting greater consistency and interpretability of its outputs. No statistically significant differences were observed in pair-wise testing between DeepSeek, OpenAI o1, and OpenAI o3-mini.

See this image and copyright information in PMC

References

1. Pena Gil Patrus and Andrade-Filho Josede Souza. How does a pathologist make a diagnosis? Archives of Pathology and Laboratory Medicine, 133(1):124–132, January 2009. - PubMed
1. Elemento Olivier, Khozin Sean, and Sternberg Cora N. The use of artificial intelligence for cancer therapeutic decision-making. NEJM AI, page AIra2401164, 2025.
1. Tripathi Aakash, Waqas Asim, Venkatesan Kavya, Ullah Ehsan, Bui Marilyn, and Rasool Ghulam. 1391 ai-driven extraction of key clinical data from pathology reports to enhance cancer registries. Laboratory Investigation, 105(3), 2025.
1. Rydzewski Nicholas R, Dinakaran Deepak, Zhao Shuang G, Ruppin Eytan, Turkbey Baris, Citrin Deborah E, and Patel Krishnan R. Comparative evaluation of llms in clinical oncology. NEJM AI, 1(5):AIoa2300151, 2024. - PMC - PubMed
1. Wu Jiageng, Liu Xiaocong, Li Minghui, Li Wanxin, Su Zichang, Lin Shixu, Garay Lucas, Zhang Zhiyun, Zhang Yujie, Zeng Qingcheng, et al. Clinical text datasets for medical artificial intelligence and large language models—a systematic review. NEJM AI, 1(6):AIra2400012, 2024.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY

Affiliations

REASONING BEYOND ACCURACY: EXPERT EVALUATION OF LARGE LANGUAGE MODELS IN DIAGNOSTIC PATHOLOGY

Authors

Affiliations

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials