Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 6;15(1):2050.
doi: 10.1038/s41467-024-46411-8.

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Affiliations

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Sarah Sandmann et al. Nat Commun. .

Abstract

It is likely that individuals are turning to Large Language Models (LLMs) to seek health advice, much like searching for diagnoses on Google. We evaluate clinical accuracy of GPT-3·5 and GPT-4 for suggesting initial diagnosis, examination steps and treatment of 110 medical cases across diverse clinical disciplines. Moreover, two model configurations of the Llama 2 open source LLMs are assessed in a sub-study. For benchmarking the diagnostic task, we conduct a naïve Google search for comparison. Overall, GPT-4 performed best with superior performances over GPT-3·5 considering diagnosis and examination and superior performance over Google for diagnosis. Except for treatment, better performance on frequent vs rare diseases is evident for all three approaches. The sub-study indicates slightly lower performances for Llama models. In conclusion, the commercial LLMs show growing potential for medical question answering in two successive major releases. However, some weaknesses underscore the need for robust and regulated AI models in health care. Open source LLMs can be a viable option to address specific needs regarding data privacy and transparency of training.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Performance comparison of GPT-3·5 vs GPT-4 vs Google.
a Performance of GPT-3·5 vs GPT-4 vs Google for diagnosis. b Performance of GPT-3·5 vs GPT-4 for examination (exact adjusted p-value p = 3.2241·10−6). c Performance of GPT-3·5 vs GPT-4 for treatment. Bubble plots show the pairwise comparison of two approaches. Cumulative frequency plots show the cumulative number of cases (Y-axis) and their accuracy scores (X-axis) for each disease frequency subgroup (light blue: rare, intermediate blue: less frequent, dark blue: frequent). One-sided Mann-Whitney test was applied for statistical testing (adjusted with Bonferroni correction for multiple testing considering n = 12 tests for diagnosis, n = 7 tests for examination and treatment).
Fig. 2
Fig. 2. Performance comparison of GPT-3·5 vs GPT-4 vs Ll2-7B vs Ll2-70B considering top-3 and bottom-3 cases.
Black dots mark the top-3 cases based on GPT-4’s cumulative score for rare, less frequent and frequent diseases. Red dots the bottom-3 cases. Violin plots visualize the performance of GPT-3·5 and GPT-4 for all n = 110 cases. Ll2-7B: Llama-2-7b-chat; LL2-70B: Llama-2-70b-chat.
Fig. 3
Fig. 3. Overview of process steps.
Cases from clinical case books were filtered and processed to generate patient queries for GPT-3·5 and GPT-4. The answers on suspected diagnosis, examination and treatment options were evaluated by two independent physicians and rated on a 5-point Likert scale.

References

    1. Varghese, J., Chapiro, J. ChatGPT: The transformative influence of generative AI on science and healthcare. J. Hepatol. 2023 [cited 2023 Sep 7]; Available from: https://www.sciencedirect.com/science/article/pii/S0168827823050390. - PubMed
    1. Deng J, Lin Y. The Benefits and Challenges of ChatGPT: An Overview. Front. Comput. Intell. Syst. 2022;2:81–83. doi: 10.54097/fcis.v2i2.4465. - DOI
    1. Surameery, N.M.S., Shakor, M.Y. Use Chat GPT to Solve Programming Bugs. Int. J. Info. Technol. Comput. Eng.(IJITC) ISSN: 2455–5290. 2023;3(01):17–22.
    1. Zheng H, Zhan H. ChatGPT in Scientific Writing: A Cautionary Tale. Am. J. Med. 2023;136:725–726.e6. doi: 10.1016/j.amjmed.2023.02.011. - DOI - PubMed
    1. Yang H. How I use ChatGPT responsibly in my teaching. Nature. 2023 [cited 2023 Apr 16]; Available from: https://www.nature.com/articles/d41586-023-01026-9. - PubMed