Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 1;5(3):100943.
doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.

Can large language models reason about medical questions?

Affiliations

Can large language models reason about medical questions?

Valentin Liévin et al. Patterns (N Y). .

Abstract

Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

Keywords: GPT-3.5; Llama 2; MedQA; large language models; machine learning; medical; open source; prompt engineering; question answering; uncertainty quantification.

PubMed Disclaimer

Figures

None
Graphical abstract
Figure 1
Figure 1
Answering a USMLE (US Medical Licensing Examination) question using zero-shot CoT prompting “Let’s think step by step” and InstructGPT Selected example.
Figure 2
Figure 2
Prompt templates In the table, we use typewriter style and brackets to represent [provided data] such as the question, additional context, or the answer and generated by GPT-3. The symbol represents an empty string.
Figure 3
Figure 3
Generative process and answer likelihood (ensemble model, i.e., self-consistency)
Figure 4
Figure 4
Frequencies of USMLE answers and InstructGPT (text-davinci-002) predictions for direct and CoT prompts (no grounding, zero-shot)
Figure 5
Figure 5
Sampling and combining multiple CoTs Answering accuracy of Codex 5-shot CoT (code-davinci-002) on the USMLE (test), the MedMCQA (validatuin), and the PubMedQA (test) datasets for 100 CoTs sampled with temperature τ{0,0.5}. We report the average accuracy for ensemble models evaluated using random subsets of k=1100 CoTs. We report the mean and standard deviation. We display the performances of the best fine-tuned methods along with the lower human baselines.
Figure 6
Figure 6
Uncertainty quantification First row: distribution of the probability assigned to the correct label for correct predictions and incorrect predictions (see Equation 1). Second row: calibration plot. The probabilities are obtained using Codex 5-shot CoT and an ensemble of k=100 predictions sampled with temperature τ=0.5.
Figure 7
Figure 7
Comparing open-source LLMs against the closed-source Codex on the MedQA-USMLE benchmark (τ=0.9, up to k=100 samples) We report answering accuracy, model calibration, and answering bias.
Figure 8
Figure 8
MedQA-USMLE accuracy vs. model size All experiments were performed using a 5-shot CoT prompting strategy and greedy decoding (τ=0). Llama 2 70B outperforms Codex 175B (proprietary).
Figure 9
Figure 9
(Sample 1) Generated zero-shot CoT from InstructGPT text-davinci-002 for three CoT prompts on a sample for the MedQA-USMLE test set

References

    1. Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013;26
    1. Pennington J., Socher R., Manning C. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) Association for Computational Linguistics; 2014. GloVe: Global vectors for word representation; pp. 1532–1543. - DOI
    1. Peters M.E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. arXiv. 2018 doi: 10.48550/arXiv.1802.05365. Preprint at. - DOI
    1. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30
    1. Devlin J., Chang M., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. 2018 doi: 10.04805/arXiv.1810.04805. Preprint at. - DOI

LinkOut - more resources