Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2023 Aug;620(7972):172-180.
doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Large language models encode clinical knowledge

Affiliations
Comparative Study

Large language models encode clinical knowledge

Karan Singhal et al. Nature. 2023 Aug.

Erratum in

  • Publisher Correction: Large language models encode clinical knowledge.
    Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Singhal K, et al. Nature. 2023 Aug;620(7973):E19. doi: 10.1038/s41586-023-06455-0. Nature. 2023. PMID: 37500979 Free PMC article. No abstract available.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

PubMed Disclaimer

Conflict of interest statement

This study was funded by Alphabet Inc. and/or a subsidiary thereof (Alphabet). K.S., S.A., T.T., V.N., A.K., S.S.M., C.S., J.W., H.W.C., N. Scales, A.T., H.C.-L., S.P., P.P., M.S., P.G., C.K., A.B., N. Schärli, A.C., P.M., B.A.A., D.W., G.S.C., Y.M., K.C., J.G., A.R., N.T., J.B. and Y.L. are employees of Alphabet and may own stock as part of the standard compensation package. D.D.-F. is affiliated with the US National Library of Medicine.

Figures

Fig. 1
Fig. 1. Overview of our contributions.
We curate MultiMedQA, a benchmark for answering medical questions spanning medical exam, medical research and consumer medical questions. We evaluate PaLM and its instructed-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM exceeds state-of-the-art performance on MedQA (US Medical Licensing Examination (USMLE)), MedMCQA, PubMedQA and MMLU clinical topics. In particular, it improves over the previous state of the art on MedQA (USMLE) by over 17%. We next propose instruction prompt tuning to further align Flan-PaLM to the medical domain, producing Med-PaLM. Med-PaLM’s answers to consumer medical questions compare favourably with answers given by clinicians under our human evaluation framework, demonstrating the effectiveness of instruction prompt tuning.
Fig. 2
Fig. 2. Comparison of our method and prior state of the art.
Our Flan-PaLM 540B model exceeds the previous state-of-the-art performance (SOTA) on MedQA (four options), MedMCQA and PubMedQA datasets. The previous state-of-the-art results are from Galactica (MedMCQA), PubMedGPT (MedQA) and BioGPT (PubMedQA). The percentage accuracy is shown above each column.
Fig. 3
Fig. 3. Selective prediction analysis.
Analysis of deferral behaviour of the Flan-PaLM 540B model with self-consistency. We observe that if we defer more frequently using an uncertainty threshold based on self-consistency, the model becomes increasingly accurate on questions it does not defer.
Fig. 4
Fig. 4. Clinician evaluation of answers.
af, Clinicians were asked to rate answers to questions in the HealthSearchQA, LiveQA and MedicationQA datasets for agreement with scientific and clinical consensus (a), the presence of incorrect content (b), the omission of content (c), the extent of possible harm (d), the likelihood of harm (e) and possible bias in answers (f). We compare answers from Flan-PaLM, Med-PaLM and clinicians. Across all axes, answers from clinicians were judged to be better than those from Flan-PaLM. Med-PaLM answers were substantially better than Flan-PaLM answers across alignment with scientific consensus, harm, missing content and bias, often comparing favourably with answers from clinicians, demonstrating the value of instruction prompt tuning for alignment to the medical domain. The evaluation involves 140 questions, each rated by a single clinician. We used the non-parametric bootstrap to estimate any significant variation in the results, with 1,000 bootstrap replicas used to produce a distribution for each set. We used the 95% bootstrap percentile interval to assess variations. Detailed results with intervals are presented in Supplementary Information, section 10.
Fig. 5
Fig. 5. Evaluation of comprehension, retrieval and reasoning capabilities by clinicians.
a,b, Evaluation of correctness (a) and incorrectness (b) of reading comprehension, recall of knowledge and reasoning steps. The results indicate a gap between Flan-PaLM and clinicians, and show that Med-PaLM is able to substantially reduce the gap. The evaluation involves 140 questions, each rated by a single clinician. We used the non-parametric bootstrap to estimate any significant variation in the results, with 1,000 bootstrap replicas used to produce a distribution for each set. We used the 95% bootstrap percentile interval to assess variations.
Fig. 6
Fig. 6. Lay user assessment of answers.
a,b, Lay user assessment of answers, addressing relevance to the intent of the query (a) and helpfulness (b). Med-PaLM answers are more likely to address the intent of users and be more helpful than Flan-PaLM answers, but they remain inferior to those provided by clinicians. The evaluation involves 140 questions, each rated by a single non-expert lay user. We used the non-parametric bootstrap to estimate any significant variation in the results, where 1,000 bootstrap replicas were used to produce a distribution for each set. We used the 95% bootstrap percentile interval to assess variations.
Extended Data Fig. 1
Extended Data Fig. 1. Instruction prompt tuning for Med-PaLM.
We use instructions and exemplars from a panel of qualified clinicians for each of the consumer medical question answering datasets and use them to instruction prompt tune Flan-PaLM. Med-PaLM is the resulting model, with additional prompt parameters aligned with the medical domain.
Extended Data Fig. 2
Extended Data Fig. 2. Comparison of SOTA LLMs on MMLU clinical topics.
Flan-PaLM achieves state-of-the-art performance on MMLU clinical topics.

Similar articles

Cited by

References

    1. Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at 10.48550/arXiv.2204.02311 (2022).
    1. Chung, H. W. et al. Scaling instruction-finetuned language models. Preprint at 10.48550/arXiv.2210.11416 (2022).
    1. Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci.11, 6421 (2021).
    1. Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (Proceedings of Machine Learning Research, 2022).
    1. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. Preprint at 10.48550/arXiv.1909.06146 (2019).

Publication types