Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun;642(8067):451-457.
doi: 10.1038/s41586-025-08869-4. Epub 2025 Apr 9.

Towards accurate differential diagnosis with large language models

Affiliations

Towards accurate differential diagnosis with large language models

Daniel McDuff et al. Nature. 2025 Jun.

Abstract

A comprehensive differential diagnosis is a cornerstone of medical care that is often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by large language models present new opportunities to assist and automate aspects of this process1. Here we introduce the Articulate Medical Intelligence Explorer (AMIE), a large language model that is optimized for diagnostic reasoning, and evaluate its ability to generate a differential diagnosis alone or as an aid to clinicians. Twenty clinicians evaluated 302 challenging, real-world medical cases sourced from published case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: assistance from search engines and standard medical resources; or assistance from AMIE in addition to these tools. All clinicians provided a baseline, unassisted differential diagnosis prior to using the respective assistive tools. AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% versus 33.6%, P = 0.04). Comparing the two assisted study arms, the differential diagnosis quality score was higher for clinicians assisted by AMIE (top-10 accuracy 51.7%) compared with clinicians without its assistance (36.1%; McNemar's test: 45.7, P < 0.01) and clinicians with search (44.4%; McNemar's test: 4.75, P = 0.03). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without assistance from AMIE. Our study suggests that AMIE has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

PubMed Disclaimer

Conflict of interest statement

Competing interests: This study was funded by Alphabet Inc. and/or a subsidiary thereof (‘Alphabet’). All authors are employees of Alphabet and may own stock as part of the standard compensation.

Figures

Fig. 1
Fig. 1. Evaluation of the quality of DDx lists from generalist physicians.
a, DDx quality score based on the question: “How close did the differential diagnoses (DDx) come to including the final diagnosis?” b, DDx comprehensiveness score based on the question: “Using your DDx list as a benchmark/gold standard, how comprehensive are the differential lists from each of the experts’?” c, DDx appropriateness score based on the question: “How appropriate was each of the DDx lists from the different medical experts compared to the differential list that you just produced?” The colours correspond to experiment arms, and the shade of the colour corresponds to different levels on the rating scales. In all cases, AMIE and clinicians assisted by AMIE scored highest overall. Numbers reflect the number of cases (out of 302). Note that the clinicians had the option of answering “I am not sure” in response to these questions; they used this option in a very small number (less than 1%) of cases.
Fig. 2
Fig. 2. Top-n accuracy in DDx lists through human and automated evaluations.
The percentage accuracy of DDx lists with the final diagnosis through human evaluation (left) or automated evaluation (right). Points reflect the mean; shaded areas show ±1 s.d. from the mean across 10 trials.
Fig. 3
Fig. 3. Sankey diagram showing effect of assistance.
a, In the AMIE arm, the final correct diagnosis appeared in the DDx list only after assistance in 73 cases. b, In the Search arm, the final correct diagnosis appeared in the DDx list only after assistance in 37 cases. In a small minority of cases in both arms (AMIE arm: 11 (a); Search arm: 12 (b)), the final diagnosis appeared in the DDx list before assistance but was not in the list after assistance.
Fig. 4
Fig. 4. Top-n accuracy in DDx lists from different LLMs.
Comparison of the percentage of DDx lists that included the final diagnosis for AMIE versus GPT-4 for 70 cases. We used Med-PaLM 2, GPT-4 and AMIE as the raters—all resulted in similar trends. Points reflect the mean; shaded areas show ±1 s.d. from the mean across 10 trials.
Extended Data Fig. 1
Extended Data Fig. 1. NEJM Clinicopathological Conference Case Reports.
History of Present Illness, Admission Labs and Admission Imaging sections were included in the redacted version presented to generalist clinicians for producing a DDx. The LLM had access to only the History of Present Illness. Specialist clinicians evaluating the quality of the DDx had access to the full (unredacted) case report including the expert differential discussion.
Extended Data Fig. 2
Extended Data Fig. 2. The AMIE User Interface.
The history of the present illness (text only) was pre-populated in the user interface (A) with an initial suggested prompt to query the LLM (B). Following this prompt and response, the user was free to enter any additional follow-up questions (C). The case shown in this figure is a mock case selected for illustrative purposes only.
Extended Data Fig. 3
Extended Data Fig. 3. Experimental Design.
To evaluate the LLM’s ability to generate DDx lists and aid clinicians with their DDx generation, we designed a two-stage reader study. First, clinicians with access only to the case presentation completed DDx lists without using any assistive tools. Second, the clinicians completed DDx lists with access either to Search engines and other resources (Condition I), or to LLM in addition to these tools (Condition II). Randomization was employed such that every case was reviewed by two different clinicians, one with LLM assistance and one without. In Condition II the clinician was given a suggested initial prompt to use in the LLM interface and was then free to try any other questions. These DDx lists were then evaluated by a specialist who had access to the full case and expert commentary on the differential diagnosis, but who was blinded to whether and what assistive tool was used.

References

    1. Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA330, 78–80 (2023). - PMC - PubMed
    1. Szolovits, P. & Pauker, S. G. Categorical and probabilistic reasoning in medical diagnosis. Artif. Intell.11, 115–144 (1978).
    1. Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med.26, 900–908 (2020). - PubMed
    1. Rauschecker, A. M. et al. Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology295, 626–637 (2020). - PMC - PubMed
    1. Balas, M. & Ing, E. B. Conversational AI models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel pro differential diagnosis generator. JFO Op. Ophthalmol.1, 100005 (2023).

LinkOut - more resources