Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 23:13:1642539.
doi: 10.3389/fcell.2025.1642539. eCollection 2025.

Multimodal reasoning agent for enhanced ophthalmic decision-making: a preliminary real-world clinical validation

Affiliations

Multimodal reasoning agent for enhanced ophthalmic decision-making: a preliminary real-world clinical validation

Yijing Zhuang et al. Front Cell Dev Biol. .

Abstract

Although large language models (LLMs) show significant potential in clinical practice, accurate diagnosis and treatment planning in ophthalmology require multimodal integration of imaging, clinical history, and guideline-based knowledge. Current LLMs predominantly focus on unimodal language tasks and face limitations in specialized ophthalmic diagnosis due to domain knowledge gaps, hallucination risks, and inadequate alignment with clinical workflows. This study introduces a structured reasoning agent (ReasonAgent) that integrates a multimodal visual analysis module, a knowledge retrieval module, and a diagnostic reasoning module to address the limitations of current AI systems in ophthalmic decision-making. Validated on 30 real-world ophthalmic cases (27 common and 3 rare diseases), ReasonAgent demonstrated diagnostic accuracy comparable to ophthalmology residents (β = -0.07, p = 0.65). However, in treatment planning, it significantly outperformed both GPT-4o (β = 0.49, p = 0.01) and residents (β = 1.71, p < 0.001), particularly excelling in rare disease scenarios (all p < 0.05). While GPT-4o showed vulnerabilities in rare cases (90.48% low diagnostic scores), ReasonAgent's hybrid design mitigated errors through structured reasoning. Statistical analysis identified significant case-level heterogeneity (diagnosis ICC = 0.28), highlighting the need for domain-specific AI solutions in complex clinical contexts. This framework establishes a novel paradigm for domain-specific AI in real-world clinical practice, demonstrating the potential of modularized architectures to advance decision fidelity through human-aligned reasoning pathways.

Keywords: GPT-4o; artificial intelligence; large language models; ocular diseases; reasoning agent.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Flowchart of the Reasoning Agent Design and the Evaluation of Different Methods’ Responses in Clinical Ophthalmology Scenarios. Ophthalmic imaging (e.g., OCT, B scan, SLO, FFA) and clinical history serve as input sources. The Vision Understanding Module (GPT-4o) analyzes ophthalmic images for abnormalities and descriptions. The Evidence Retrieval Module (RAG) extracts diagnostic knowledge from guidelines based on clinical history and ocular examination. These outputs, combined with clinical history text, are input into the Diagnostic Reasoning Module (DeepSeek-R1) within the reasoning agent for diagnostic analysis and treatment planning. Comparison groups included standalone GPT-4o and three residents. Responses were evaluated using Likert scales by 7 attending physicians.
FIGURE 2
FIGURE 2
Distribution of Likert Scores for Different Methods in Diagnostic Tasks and Treatment Planning Tasks. (A) Violin plot of Likert scores for diagnostic tasks; (B) Violin plot of Likert scores for treatment planning tasks. Embedded boxplots illustrate the interquartile range (25th to 75th percentile), the median (black horizontal line), and the whiskers represent the range of scores excluding outliers. Statistical analysis revealed no significant differences in diagnostic task scores between ReasonAgent, GPT-4o, and residents. In contrast, treatment planning tasks showed significantly higher scores for ReasonAgent than GPT-4o and residents. *p < 0.05, **p < 0.01, ***p < 0.001.

Similar articles

References

    1. Bommasani R., Hudson D., Adeli E., Altman R., Arora S., Arx S., et al. (2021). On the opportunities and risks of foundation models.
    1. Cai X., Zhan L., Lin Y. (2024). Assessing the accuracy and clinical utility of GPT-4O in abnormal blood cell morphology recognition. Digit. Health 10, 20552076241298503. 10.1177/20552076241298503 - DOI - PMC - PubMed
    1. Chen D., Huang R. S., Jomy J., Wong P., Yan M., Croke J., et al. (2024a). Performance of multimodal artificial intelligence chatbots evaluated on clinical oncology cases. JAMA Netw. Open 7 (10), e2437711. 10.1001/jamanetworkopen.2024.37711 - DOI - PMC - PubMed
    1. Chen J., Xiao S., Zhang P., Luo K., Lian D., Liu Z. J. a.p.a. (2024b). Bge m3-embedding: multi-Lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.
    1. Choi J., Oh A. R., Park J., Kang R. A., Yoo S. Y., Lee D. J., et al. (2024). Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0. Front. Med. 11, 1400153. 10.3389/fmed.2024.1400153 - DOI - PMC - PubMed

LinkOut - more resources