Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Jul 1;15(1):20556.
doi: 10.1038/s41598-025-06404-z.

Large language models provide discordant information compared to ophthalmology guidelines

Affiliations
Comparative Study

Large language models provide discordant information compared to ophthalmology guidelines

Andrea Taloni et al. Sci Rep. .

Abstract

To evaluate the agreement of LLMs with the Preferred Practice Patterns® (PPP) guidelines developed by the American Academy of Ophthalmology (AAO). Open questions based on the AAO PPP were submitted to five LLMs: GPT-o1 and GPT-4o by OpenAI, Claude 3.5 Sonnet by Anthropic, Gemini 1.5 Pro by Google, and DeepSeek-R1-Lite-Preview. Questions were classified as "open" or "confirmatory with positive/negative ground-truth answer". Three blinded investigators classified responses as "concordant", "undetermined", or "discordant" compared to the AAO PPP. Undetermined and discordant answers were analyzed to assess harming potential for patients. Responses referencing peer-reviewed articles were reported. In total, 147 questions were submitted to the LLMs. Concordant answers were 135 (91.8%) for GPT-o1, 133 (90.5%) for GPT-4o, 136 (92.5%) for Claude 3.5 Sonnet, 124 (84.4%) for Gemini 1.5 Pro, and 119 (81.0%) for DeepSeek-R1-Lite-Preview (P = 0.006). The highest number of harmful answers was reported for Gemini 1.5 Pro (n = 6, 4.1%), followed by DeepSeek-R1-Lite-Preview (n = 5, 3.4%). Gemini 1.5 Pro was the most transparent model (86 references, 58.5%). Other LLMs referenced papers in 9.5-15.6% of their responses. LLMs can provide discordant answers compared to ophthalmology guidelines, potentially harming patients by delaying diagnosis or recommending suboptimal treatments.

Keywords: AAO; American Academy of Ophthalmology; Artificial intelligence; Guidelines; Large language model; Preferred practice patterns.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests. Ethical approval: The research did not involve humans or animals.

Figures

Fig. 1
Fig. 1
Number of concordant (green), undetermined (yellow), and discordant answers (red) compared to the American Academy of Ophthalmology Preferred Practice Patterns®.
Fig. 2
Fig. 2
Number of answers potentially harmful to the patients.
Fig. 3
Fig. 3
Number of answers containing references to peer-reviewed articles.

Similar articles

References

    1. Taloni, A. et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci. Rep.13, 18562. 10.1038/s41598-023-45837-2 (2023). - PMC - PubMed
    1. Haddad, F. & Saade, J. S. Performance of ChatGPT on ophthalmology-related questions across various examination levels: Observational study. JMIR Med. Educ.10, e50842. 10.2196/50842 (2024). - PMC - PubMed
    1. Sakai, D. et al. Performance of ChatGPT in board examinations for specialists in the Japanese Ophthalmology Society. Cureus15, e49903. 10.7759/cureus.49903 (2023). - PMC - PubMed
    1. Gill, G. S. et al. Comparison of Gemini advanced and ChatGPT 4.0’s performances on the ophthalmology resident ophthalmic knowledge assessment program (OKAP) examination review question banks. Cureus16, e69612. 10.7759/cureus.69612 (2024). - PMC - PubMed
    1. Gill, G. S., Blair, J. & Litinsky, S. Evaluating the performance of ChatGPT 3.5 and 4.0 on StatPearls oculoplastic surgery text- and image-based exam questions. Cureus16, e73812. 10.7759/cureus.73812 (2024). - PMC - PubMed

Publication types

LinkOut - more resources