Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
- PMID: 39043988
- PMCID: PMC11266508
- DOI: 10.1038/s41746-024-01185-7
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Abstract
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
© 2024. This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply.
Conflict of interest statement
The authors declare no competing interests but the following competing financial interests: R.S. receives royalties for patents or software licenses from iCAD, Philips, ScanMed, PingAn, Translation Holdings, and MGB. R.S. received research support from PingAn.
Figures


Update of
-
Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine.ArXiv [Preprint]. 2024 Aug 31:arXiv:2401.08396v4. ArXiv. 2024. Update in: NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7. PMID: 38410646 Free PMC article. Updated. Preprint.
References
-
- OpenAI. GPT-4 Technical Report. Preprint at arXiv10.48550/arXiv.2303.08774 (2023).