Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 1;7(10):e2437711.
doi: 10.1001/jamanetworkopen.2024.37711.

Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology Cases

Affiliations

Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology Cases

David Chen et al. JAMA Netw Open. .

Abstract

Importance: Multimodal artificial intelligence (AI) chatbots can process complex medical image and text-based information that may improve their accuracy as a clinical diagnostic and management tool compared with unimodal, text-only AI chatbots. However, the difference in medical accuracy of multimodal and text-only chatbots in addressing questions about clinical oncology cases remains to be tested.

Objective: To evaluate the utility of prompt engineering (zero-shot chain-of-thought) and compare the competency of multimodal and unimodal AI chatbots to generate medically accurate responses to questions about clinical oncology cases.

Design, setting, and participants: This cross-sectional study benchmarked the medical accuracy of multiple-choice and free-text responses generated by AI chatbots in response to 79 questions about clinical oncology cases with images.

Exposures: A unique set of 79 clinical oncology cases from JAMA Network Learning accessed on April 2, 2024, was posed to 10 AI chatbots.

Main outcomes and measures: The primary outcome was medical accuracy evaluated by the number of correct responses by each AI chatbot. Multiple-choice responses were marked as correct based on the ground-truth, correct answer. Free-text responses were rated by a team of oncology specialists in duplicate and marked as correct based on consensus or resolved by a review of a third oncology specialist.

Results: This study evaluated 10 chatbots, including 3 multimodal and 7 unimodal chatbots. On the multiple-choice evaluation, the top-performing chatbot was chatbot 10 (57 of 79 [72.15%]), followed by the multimodal chatbot 2 (56 of 79 [70.89%]) and chatbot 5 (54 of 79 [68.35%]). On the free-text evaluation, the top-performing chatbots were chatbot 5, chatbot 7, and the multimodal chatbot 2 (30 of 79 [37.97%]), followed by chatbot 10 (29 of 79 [36.71%]) and chatbot 8 and the multimodal chatbot 3 (25 of 79 [31.65%]). The accuracy of multimodal chatbots decreased when tested on cases with multiple images compared with questions with single images. Nine out of 10 chatbots, including all 3 multimodal chatbots, demonstrated decreased accuracy of their free-text responses compared with multiple-choice responses to questions about cancer cases.

Conclusions and relevance: In this cross-sectional study of chatbot accuracy tested on clinical oncology cases, multimodal chatbots were not consistently more accurate than unimodal chatbots. These results suggest that further research is required to optimize multimodal chatbots to make more use of information from images to improve oncology-specific medical accuracy and reliability.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Dr Hope reported grants from the Canadian Institute of Health Research, personal fees from AstraZeneca Canada, and nonfinancial support from Elekta Inc outside the submitted work. No other disclosures were reported.

Figures

Figure.
Figure.. Proportion of Correct Responses to Oncology Case Questions Evaluated Based on Multiple-Choice Response and Free-Text Response
Proportion of correct responses out of total number of responses to oncology case questions (N = 79) evaluated based on multiple-choice response (A) and free-text response (B).

References

    1. Ayers JW, Poliak A, Dredze M, et al. . Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838 - DOI - PMC - PubMed
    1. Mihalache A, Huang RS, Popovic MM, et al. . Accuracy of an artificial intelligence chatbot’s interpretation of clinical ophthalmic images. JAMA Ophthalmol. 2024;142(4):321-326. doi:10.1001/jamaophthalmol.2024.0017 - DOI - PMC - PubMed
    1. Horiuchi D, Tatekawa H, Shimono T, et al. . Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology. 2024;66(1):73-79. doi:10.1007/s00234-023-03252-4 - DOI - PubMed
    1. Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA. 2024;331(15):1320-1321. doi:10.1001/jama.2023.27861 - DOI - PMC - PubMed
    1. Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3(1):17. doi:10.1038/s41746-020-0221-y - DOI - PMC - PubMed

Publication types