ChatGPT in Oral Pathology: Bright Promise or Diagnostic Mirage
- PMID: 41155731
- PMCID: PMC12566193
- DOI: 10.3390/medicina61101744
ChatGPT in Oral Pathology: Bright Promise or Diagnostic Mirage
Abstract
Background and Objectives: The growing academic interest within the biomedical sciences regarding the diagnostic capabilities of multimodal language models, such as ChatGPT-4o, is clear. However, their ability to interpret oral clinical images remains insufficiently explored. This exploratory pilot study aimed to provide preliminary observations about the diagnostic validity of ChatGPT-4o in identifying oral squamous cell carcinoma (OSCC), oral leukoplakia (OL), and oral lichen planus (OLP) using only clinical photographs, without the inclusion of additional clinical data. Materials and Methods: Two general dentists selected 23 images of oral lesions suspected to be OSCC, OL, or OLP. ChatGPT-4o was asked to provide a probable diagnosis for each image on 30 occasions, generating a total of 690 responses. The responses were then evaluated against the reference diagnosis set up by an expert to calculate sensitivity, specificity, predictive values, and the area under the ROC curve. Results: ChatGPT-4o demonstrated high specificity across the three conditions (97.1% for OSCC, 100% for OL, and 96.1% for OLP), correctly classifying 90% of OSCC cases (AUC = 0.81). However, this overall accuracy was largely driven by correct negative classifications, while the clinically relevant sensitivity for OSCC was only 65%. In spite of that, sensitivity was highly variable: 60% for OL and just 25% for OLP, which limits its usefulness in a clinical setting for ruling out these conditions. The model achieved positive predictive values of 86.7% for OSCC and 100% for OL. Given the small dataset, these findings should be interpreted only as preliminary evidence. Conclusions: ChatGPT-4o demonstrates potential as a complementary tool for the screening of OSCC in clinical oral images. Nevertheless, the pilot nature of this study and the reduced sample size highlight that larger, adequately powered studies (with several hundred cases per pathology) are needed to obtain robust and generalizable results. Nevertheless, its sensitivity remains insufficient, as a significant proportion of true cases were missed, underscoring that the model cannot be relied upon as a standalone diagnostic tool.
Keywords: ChatGPT; diagnostic accuracy; multimodal large language models (LLMs); oral leukoplakia (OL); oral lichen planus (OLP); oral pathology; oral squamous cell carcinoma (OSCC).
Conflict of interest statement
The authors declare no conflicts of interest.
References
-
- Kämmer J.E., Hautz W.E., Krummrey G., Sauter T.C., Penders D., Birrenbach T., Bienefeld N. Effects of Interacting with a Large Language Model Compared with a Human Coach on the Clinical Diagnostic Process and Outcomes among Fourth-Year Medical Students: Study Protocol for a Prospective, Randomised Experiment Using Patient Vignettes. BMJ Open. 2024;14:e087469. doi: 10.1136/bmjopen-2024-087469. - DOI - PMC - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
