Towards accurate differential diagnosis with large language models

Daniel McDuff^#¹, Mike Schaekermann^#², Tao Tu^#³, Anil Palepu^#⁴, Amy Wang⁵, Jake Garrison⁶, Karan Singhal³, Yash Sharma⁴, Shekoofeh Azizi⁷, Kavita Kulkarni⁴, Le Hou⁴, Yong Cheng⁸, Yun Liu⁴, S Sara Mahdavi⁷, Sushant Prakash³, Anupam Pathak⁴, Christopher Semturs⁴, Shwetak Patel⁶, Dale R Webster⁴, Ewa Dominowska⁶, Juraj Gottweis⁹, Joelle Barral¹⁰, Katherine Chou⁴, Greg S Corrado⁴, Yossi Matias⁴, Jake Sunshine¹¹, Alan Karthikesalingam¹², Vivek Natarajan¹³

Affiliations

¹ Google Research, Seattle, WA, USA. dmcduff@google.com.
² Google Research, Toronto, Ontario, Canada. mikeshake@google.com.
³ Google Research, New York City, NY, USA.
⁴ Google Research, Mountain View, CA, USA.
⁵ Google Research, Toronto, Ontario, Canada.
⁶ Google Research, Seattle, WA, USA.
⁷ Google DeepMind, Toronto, Ontario, Canada.
⁸ Google DeepMind, Mountain View, CA, USA.
⁹ Google Research, Zurich, Switzerland.
¹⁰ Google DeepMind, Paris, France.
¹¹ Google Research, Seattle, WA, USA. jakesunshine@google.com.
¹² Google Research, London, UK. alankarthi@google.com.
¹³ Google Research, Mountain View, CA, USA. natviv@google.com.

^# Contributed equally.

PMID: 40205049
PMCID: PMC12158753
DOI: 10.1038/s41586-025-08869-4

Towards accurate differential diagnosis with large language models

Daniel McDuff et al. Nature. 2025 Jun.

. 2025 Jun;642(8067):451-457.

doi: 10.1038/s41586-025-08869-4. Epub 2025 Apr 9.

Authors

Affiliations

¹ Google Research, Seattle, WA, USA. dmcduff@google.com.
² Google Research, Toronto, Ontario, Canada. mikeshake@google.com.
³ Google Research, New York City, NY, USA.
⁴ Google Research, Mountain View, CA, USA.
⁵ Google Research, Toronto, Ontario, Canada.
⁶ Google Research, Seattle, WA, USA.
⁷ Google DeepMind, Toronto, Ontario, Canada.
⁸ Google DeepMind, Mountain View, CA, USA.
⁹ Google Research, Zurich, Switzerland.
¹⁰ Google DeepMind, Paris, France.
¹¹ Google Research, Seattle, WA, USA. jakesunshine@google.com.
¹² Google Research, London, UK. alankarthi@google.com.
¹³ Google Research, Mountain View, CA, USA. natviv@google.com.

^# Contributed equally.

PMID: 40205049
PMCID: PMC12158753
DOI: 10.1038/s41586-025-08869-4

Abstract

A comprehensive differential diagnosis is a cornerstone of medical care that is often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by large language models present new opportunities to assist and automate aspects of this process¹. Here we introduce the Articulate Medical Intelligence Explorer (AMIE), a large language model that is optimized for diagnostic reasoning, and evaluate its ability to generate a differential diagnosis alone or as an aid to clinicians. Twenty clinicians evaluated 302 challenging, real-world medical cases sourced from published case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: assistance from search engines and standard medical resources; or assistance from AMIE in addition to these tools. All clinicians provided a baseline, unassisted differential diagnosis prior to using the respective assistive tools. AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% versus 33.6%, P = 0.04). Comparing the two assisted study arms, the differential diagnosis quality score was higher for clinicians assisted by AMIE (top-10 accuracy 51.7%) compared with clinicians without its assistance (36.1%; McNemar's test: 45.7, P < 0.01) and clinicians with search (44.4%; McNemar's test: 4.75, P = 0.03). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without assistance from AMIE. Our study suggests that AMIE has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

PubMed Disclaimer

Conflict of interest statement

Competing interests: This study was funded by Alphabet Inc. and/or a subsidiary thereof (‘Alphabet’). All authors are employees of Alphabet and may own stock as part of the standard compensation.

Figures

**Fig. 1. Evaluation of the quality of DDx lists from generalist physicians.**
a, DDx quality score based on the question: “How close did the differential diagnoses (DDx) come to including the final diagnosis?” b, DDx comprehensiveness score based on the question: “Using your DDx list as a benchmark/gold standard, how comprehensive are the differential lists from each of the experts’?” c, DDx appropriateness score based on the question: “How appropriate was each of the DDx lists from the different medical experts compared to the differential list that you just produced?” The colours correspond to experiment arms, and the shade of the colour corresponds to different levels on the rating scales. In all cases, AMIE and clinicians assisted by AMIE scored highest overall. Numbers reflect the number of cases (out of 302). Note that the clinicians had the option of answering “I am not sure” in response to these questions; they used this option in a very small number (less than 1%) of cases.

**Fig. 2. Top-n accuracy in DDx lists through human and automated evaluations.**
The percentage accuracy of DDx lists with the final diagnosis through human evaluation (left) or automated evaluation (right). Points reflect the mean; shaded areas show ±1 s.d. from the mean across 10 trials.

**Fig. 3. Sankey diagram showing effect of assistance.**
a, In the AMIE arm, the final correct diagnosis appeared in the DDx list only after assistance in 73 cases. b, In the Search arm, the final correct diagnosis appeared in the DDx list only after assistance in 37 cases. In a small minority of cases in both arms (AMIE arm: 11 (a); Search arm: 12 (b)), the final diagnosis appeared in the DDx list before assistance but was not in the list after assistance.

**Fig. 4. Top-n accuracy in DDx lists from different LLMs.**
Comparison of the percentage of DDx lists that included the final diagnosis for AMIE versus GPT-4 for 70 cases. We used Med-PaLM 2, GPT-4 and AMIE as the raters—all resulted in similar trends. Points reflect the mean; shaded areas show ±1 s.d. from the mean across 10 trials.

**Extended Data Fig. 1. NEJM Clinicopathological Conference Case Reports.**
History of Present Illness, Admission Labs and Admission Imaging sections were included in the redacted version presented to generalist clinicians for producing a DDx. The LLM had access to only the History of Present Illness. Specialist clinicians evaluating the quality of the DDx had access to the full (unredacted) case report including the expert differential discussion.

**Extended Data Fig. 2. The AMIE User Interface.**
The history of the present illness (text only) was pre-populated in the user interface (A) with an initial suggested prompt to query the LLM (B). Following this prompt and response, the user was free to enter any additional follow-up questions (C). The case shown in this figure is a mock case selected for illustrative purposes only.

**Extended Data Fig. 3. Experimental Design.**
To evaluate the LLM’s ability to generate DDx lists and aid clinicians with their DDx generation, we designed a two-stage reader study. First, clinicians with access only to the case presentation completed DDx lists without using any assistive tools. Second, the clinicians completed DDx lists with access *either* to Search engines and other resources (Condition I), or to LLM in addition to these tools (Condition II). Randomization was employed such that every case was reviewed by two different clinicians, one with LLM assistance and one without. In Condition II the clinician was given a suggested initial prompt to use in the LLM interface and was then free to try any other questions. These DDx lists were then evaluated by a specialist who had access to the full case and expert commentary on the differential diagnosis, but who was blinded to whether and what assistive tool was used.

See this image and copyright information in PMC

References

1. Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA330, 78–80 (2023). - DOI - PMC - PubMed
1. Szolovits, P. & Pauker, S. G. Categorical and probabilistic reasoning in medical diagnosis. Artif. Intell.11, 115–144 (1978). - DOI
1. Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med.26, 900–908 (2020). - DOI - PubMed
1. Rauschecker, A. M. et al. Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology295, 626–637 (2020). - DOI - PMC - PubMed
1. Balas, M. & Ing, E. B. Conversational AI models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel pro differential diagnosis generator. JFO Op. Ophthalmol.1, 100005 (2023). - DOI

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Towards accurate differential diagnosis with large language models

Affiliations

Towards accurate differential diagnosis with large language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical