Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 25;61(10):1744.
doi: 10.3390/medicina61101744.

ChatGPT in Oral Pathology: Bright Promise or Diagnostic Mirage

Affiliations

ChatGPT in Oral Pathology: Bright Promise or Diagnostic Mirage

Ana Suárez et al. Medicina (Kaunas). .

Abstract

Background and Objectives: The growing academic interest within the biomedical sciences regarding the diagnostic capabilities of multimodal language models, such as ChatGPT-4o, is clear. However, their ability to interpret oral clinical images remains insufficiently explored. This exploratory pilot study aimed to provide preliminary observations about the diagnostic validity of ChatGPT-4o in identifying oral squamous cell carcinoma (OSCC), oral leukoplakia (OL), and oral lichen planus (OLP) using only clinical photographs, without the inclusion of additional clinical data. Materials and Methods: Two general dentists selected 23 images of oral lesions suspected to be OSCC, OL, or OLP. ChatGPT-4o was asked to provide a probable diagnosis for each image on 30 occasions, generating a total of 690 responses. The responses were then evaluated against the reference diagnosis set up by an expert to calculate sensitivity, specificity, predictive values, and the area under the ROC curve. Results: ChatGPT-4o demonstrated high specificity across the three conditions (97.1% for OSCC, 100% for OL, and 96.1% for OLP), correctly classifying 90% of OSCC cases (AUC = 0.81). However, this overall accuracy was largely driven by correct negative classifications, while the clinically relevant sensitivity for OSCC was only 65%. In spite of that, sensitivity was highly variable: 60% for OL and just 25% for OLP, which limits its usefulness in a clinical setting for ruling out these conditions. The model achieved positive predictive values of 86.7% for OSCC and 100% for OL. Given the small dataset, these findings should be interpreted only as preliminary evidence. Conclusions: ChatGPT-4o demonstrates potential as a complementary tool for the screening of OSCC in clinical oral images. Nevertheless, the pilot nature of this study and the reduced sample size highlight that larger, adequately powered studies (with several hundred cases per pathology) are needed to obtain robust and generalizable results. Nevertheless, its sensitivity remains insufficient, as a significant proportion of true cases were missed, underscoring that the model cannot be relied upon as a standalone diagnostic tool.

Keywords: ChatGPT; diagnostic accuracy; multimodal large language models (LLMs); oral leukoplakia (OL); oral lichen planus (OLP); oral pathology; oral squamous cell carcinoma (OSCC).

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

References

    1. Umer F., Batool I., Naved N. Innovation and Application of Large Language Models (LLMs) in Dentistry—A Scoping Review. BDJ Open. 2024;10:90. doi: 10.1038/s41405-024-00277-6. - DOI - PMC - PubMed
    1. Nia M.F., Ahmadi M., Irankhah E. Transforming dental diagnostics with artificial intelligence: Advanced integration of ChatGPT and large language models for patient care. Front. Dent. Med. 2025;5:1456208. doi: 10.3389/fdmed.2024.1456208. - DOI - PMC - PubMed
    1. Kämmer J.E., Hautz W.E., Krummrey G., Sauter T.C., Penders D., Birrenbach T., Bienefeld N. Effects of Interacting with a Large Language Model Compared with a Human Coach on the Clinical Diagnostic Process and Outcomes among Fourth-Year Medical Students: Study Protocol for a Prospective, Randomised Experiment Using Patient Vignettes. BMJ Open. 2024;14:e087469. doi: 10.1136/bmjopen-2024-087469. - DOI - PMC - PubMed
    1. Savage T., Nayak A., Gallo R., Rangan E., Chen J.H. Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine. NPJ Digit. Med. 2024;7:20. doi: 10.1038/s41746-024-01010-1. - DOI - PMC - PubMed
    1. Abbasian Ardakani A., Airom O., Khorshidi H., Bureau N.J., Salvi M., Molinari F., Acharya U.R. Interpretation of Artificial Intelligence Models in Healthcare. J. Ultrasound Med. 2024;43:1789–1818. doi: 10.1002/jum.16524. - DOI - PubMed

LinkOut - more resources