Comparative Study

. 2023 Aug;620(7972):172-180.

doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Large language models encode clinical knowledge

Karan Singhal^#¹, Shekoofeh Azizi^#², Tao Tu^#³, S Sara Mahdavi³, Jason Wei³, Hyung Won Chung³, Nathan Scales³, Ajay Tanwani³, Heather Cole-Lewis³, Stephen Pfohl³, Perry Payne³, Martin Seneviratne³, Paul Gamble³, Chris Kelly³, Abubakr Babiker³, Nathanael Schärli³, Aakanksha Chowdhery³, Philip Mansfield³, Dina Demner-Fushman⁴, Blaise Agüera Y Arcas³, Dale Webster³, Greg S Corrado³, Yossi Matias³, Katherine Chou³, Juraj Gottweis³, Nenad Tomasev⁵, Yun Liu³, Alvin Rajkomar³, Joelle Barral³, Christopher Semturs³, Alan Karthikesalingam⁶, Vivek Natarajan⁷

Affiliations

¹ Google Research, Mountain View, CA, USA. karansinghal@google.com.
² Google Research, Mountain View, CA, USA. shekazizi@google.com.
³ Google Research, Mountain View, CA, USA.
⁴ National Library of Medicine, Bethesda, MD, USA.
⁵ DeepMind, London, UK.
⁶ Google Research, Mountain View, CA, USA. alankarthi@google.com.
⁷ Google Research, Mountain View, CA, USA. natviv@google.com.

^# Contributed equally.

PMID: 37438534
PMCID: PMC10396962
DOI: 10.1038/s41586-023-06291-2

Comparative Study

Large language models encode clinical knowledge

Karan Singhal et al. Nature. 2023 Aug.

. 2023 Aug;620(7972):172-180.

doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Authors

Affiliations

¹ Google Research, Mountain View, CA, USA. karansinghal@google.com.
² Google Research, Mountain View, CA, USA. shekazizi@google.com.
³ Google Research, Mountain View, CA, USA.
⁴ National Library of Medicine, Bethesda, MD, USA.
⁵ DeepMind, London, UK.
⁶ Google Research, Mountain View, CA, USA. alankarthi@google.com.
⁷ Google Research, Mountain View, CA, USA. natviv@google.com.

^# Contributed equally.

PMID: 37438534
PMCID: PMC10396962
DOI: 10.1038/s41586-023-06291-2

Erratum in

Publisher Correction: Large language models encode clinical knowledge.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Singhal K, et al. Nature. 2023 Aug;620(7973):E19. doi: 10.1038/s41586-023-06455-0. Nature. 2023. PMID: 37500979 Free PMC article. No abstract available.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model¹ (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM² on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA³, MedMCQA⁴, PubMedQA⁵ and Measuring Massive Multitask Language Understanding (MMLU) clinical topics⁶), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

PubMed Disclaimer

Conflict of interest statement

This study was funded by Alphabet Inc. and/or a subsidiary thereof (Alphabet). K.S., S.A., T.T., V.N., A.K., S.S.M., C.S., J.W., H.W.C., N. Scales, A.T., H.C.-L., S.P., P.P., M.S., P.G., C.K., A.B., N. Schärli, A.C., P.M., B.A.A., D.W., G.S.C., Y.M., K.C., J.G., A.R., N.T., J.B. and Y.L. are employees of Alphabet and may own stock as part of the standard compensation package. D.D.-F. is affiliated with the US National Library of Medicine.

Figures

**Fig. 1. Overview of our contributions.**
We curate MultiMedQA, a benchmark for answering medical questions spanning medical exam, medical research and consumer medical questions. We evaluate PaLM and its instructed-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM exceeds state-of-the-art performance on MedQA (US Medical Licensing Examination (USMLE)), MedMCQA, PubMedQA and MMLU clinical topics. In particular, it improves over the previous state of the art on MedQA (USMLE) by over 17%. We next propose instruction prompt tuning to further align Flan-PaLM to the medical domain, producing Med-PaLM. Med-PaLM’s answers to consumer medical questions compare favourably with answers given by clinicians under our human evaluation framework, demonstrating the effectiveness of instruction prompt tuning.

**Fig. 2. Comparison of our method and prior state of the art.**
Our Flan-PaLM 540B model exceeds the previous state-of-the-art performance (SOTA) on MedQA (four options), MedMCQA and PubMedQA datasets. The previous state-of-the-art results are from Galactica (MedMCQA), PubMedGPT (MedQA) and BioGPT (PubMedQA). The percentage accuracy is shown above each column.

**Fig. 3. Selective prediction analysis.**
Analysis of deferral behaviour of the Flan-PaLM 540B model with self-consistency. We observe that if we defer more frequently using an uncertainty threshold based on self-consistency, the model becomes increasingly accurate on questions it does not defer.

**Fig. 4. Clinician evaluation of answers.**
a–f, Clinicians were asked to rate answers to questions in the HealthSearchQA, LiveQA and MedicationQA datasets for agreement with scientific and clinical consensus (a), the presence of incorrect content (b), the omission of content (c), the extent of possible harm (d), the likelihood of harm (e) and possible bias in answers (f). We compare answers from Flan-PaLM, Med-PaLM and clinicians. Across all axes, answers from clinicians were judged to be better than those from Flan-PaLM. Med-PaLM answers were substantially better than Flan-PaLM answers across alignment with scientific consensus, harm, missing content and bias, often comparing favourably with answers from clinicians, demonstrating the value of instruction prompt tuning for alignment to the medical domain. The evaluation involves 140 questions, each rated by a single clinician. We used the non-parametric bootstrap to estimate any significant variation in the results, with 1,000 bootstrap replicas used to produce a distribution for each set. We used the 95% bootstrap percentile interval to assess variations. Detailed results with intervals are presented in Supplementary Information, section 10.

**Fig. 5. Evaluation of comprehension, retrieval and reasoning capabilities by clinicians.**
a,b, Evaluation of correctness (a) and incorrectness (b) of reading comprehension, recall of knowledge and reasoning steps. The results indicate a gap between Flan-PaLM and clinicians, and show that Med-PaLM is able to substantially reduce the gap. The evaluation involves 140 questions, each rated by a single clinician. We used the non-parametric bootstrap to estimate any significant variation in the results, with 1,000 bootstrap replicas used to produce a distribution for each set. We used the 95% bootstrap percentile interval to assess variations.

**Fig. 6. Lay user assessment of answers.**
a,b, Lay user assessment of answers, addressing relevance to the intent of the query (a) and helpfulness (b). Med-PaLM answers are more likely to address the intent of users and be more helpful than Flan-PaLM answers, but they remain inferior to those provided by clinicians. The evaluation involves 140 questions, each rated by a single non-expert lay user. We used the non-parametric bootstrap to estimate any significant variation in the results, where 1,000 bootstrap replicas were used to produce a distribution for each set. We used the 95% bootstrap percentile interval to assess variations.

**Extended Data Fig. 1. Instruction prompt tuning for Med-PaLM.**
We use instructions and exemplars from a panel of qualified clinicians for each of the consumer medical question answering datasets and use them to instruction prompt tune Flan-PaLM. Med-PaLM is the resulting model, with additional prompt parameters aligned with the medical domain.

**Extended Data Fig. 2. Comparison of SOTA LLMs on MMLU clinical topics.**
Flan-PaLM achieves state-of-the-art performance on MMLU clinical topics.

See this image and copyright information in PMC

References

1. Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at 10.48550/arXiv.2204.02311 (2022).
1. Chung, H. W. et al. Scaling instruction-finetuned language models. Preprint at 10.48550/arXiv.2210.11416 (2022).
1. Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci.11, 6421 (2021). - DOI
1. Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (Proceedings of Machine Learning Research, 2022).
1. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. Preprint at 10.48550/arXiv.1909.06146 (2019).

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- ClinicalTrials.gov
- The YODA Project

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large language models encode clinical knowledge

Affiliations

Large language models encode clinical knowledge

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical