Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jul 24:arXiv:2409.15087v2.

Towards Accountable AI in Eye Disease Diagnosis: Workflow, External Validation, and Development

Affiliations

Towards Accountable AI in Eye Disease Diagnosis: Workflow, External Validation, and Development

Qingyu Chen et al. ArXiv. .

Update in

  • AI Workflow, External Validation, and Development in Eye Disease Diagnosis.
    Chen Q, Keenan TDL, Agron E, Allot A, Guan E, Duong B, Elsawy A, Hou B, Xue C, Bhandari S, Broadhead G, Cousineau-Krieger C, Davis E, Gensheimer WG, Golshani CA, Grasic D, Gupta S, Haddock L, Konstantinou E, Lamba T, Maiberger M, Mantopoulos D, Mehta MC, Elnahry AG, Al-Nawaflh M, Oshinsky A, Powell BE, Purt B, Shin S, Stiefel H, Thavikulwat AT, Wroblewski KJ, Tham YC, Cheung CMG, Cheng CY, Chew EY, Hribar MR, Chiang MF, Lu Z. Chen Q, et al. JAMA Netw Open. 2025 Jul 1;8(7):e2517204. doi: 10.1001/jamanetworkopen.2025.17204. JAMA Netw Open. 2025. PMID: 40668583 Free PMC article.

Abstract

Importance: Timely disease diagnosis is challenging due to limited clinical availability and growing burdens. Although artificial intelligence (AI) shows expert-level diagnostic accuracy, a lack of downstream accountability-including workflow integration, external validation, and further development- continues to hinder its real-world adoption.

Objective: To address gaps in the downstream accountability of medical AI through a case study on age-related macular degeneration (AMD) diagnosis and severity classification.

Design setting and participants: We developed and evaluated an AI-assisted diagnostic and classification workflow for AMD. Four rounds of diagnostic assessments (accuracy and time) were conducted with 24 clinicians from 12 institutions. Each round was randomized and alternated between Manual and Manual + AI, with a washout period. In total, 2,880 AMD risk features were evaluated across 960 images from 240 Age-Related Eye Disease Study patient samples, both with and without AI assistance. For further development, we enhanced the original DeepSeeNet model into DeepSeeNet+ using ~40,000 additional images from the US population and tested it on three datasets, including an external set from Singapore.

Main outcomes and measures: We measured the F1-score for accuracy (Wilcoxon rank-sum test) and diagnostic time (linear mixed-effects model), comparing Manual vs. Manual + AI. For further development, the F1-score (Wilcoxon rank-sum) was again used.

Results: Among the 240 patients (mean age, 68.5 years; 53% female), AI assistance improved accuracy for 23 of 24 clinicians, increasing the average F1-score by 20% (37.71 to 45.52), with some improvements exceeding 50%. Manual diagnosis initially took an estimated 39.8 seconds per patient, whereas Manual + AI saved 10.3 seconds and remained 1.7-3.3 seconds faster in later rounds. However, combining manual and AI may not always yield the highest accuracy or efficiency, underscoring challenges in explainability and trust. DeepSeeNet+ performed better in three test sets, achieving 13% higher F1-score in the Singapore cohort.

Conclusions and relevance: In this diagnostic study, AI assistance improved both accuracy and time efficiency for AMD diagnosis. Further development was essential for enhancing AI generalizability across diverse populations. These findings highlight the need for downstream accountability during early-stage clinical evaluations of medical AI. All code and models are publicly available.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the AI-assisted diagnostic/classification workflow. (A) Dataset of 240 patients with 480 images, 40 samples per severity level, with distribution of risk factors—drusen, pigment abnormalities, and late AMD. (B) Division of samples into batches. (C) Evaluation pipeline where clinicians grade color fundus photographs with and without AI assistance.
Figure 2.
Figure 2.
Comparison of Diagnostic Performance (F1-score): Manual vs. Manual + AI Assessment. Each dot represents an F1-score. (A) Final AMD scale and individual risk factors. (B) Drusen. (C) Pigment. (D) Late AMD. The cutoff line represents the performance of the AI model alone. (E) Changes in performance for individual clinicians in terms of F1-score. Blue dots; manual assessment, red dots; AI-assisted manual assessment, solid lines; retina specialists, dashed lines; comprehensive ophthalmologists.
Figure 3.
Figure 3.
Detailed Breakdown of F1-Score Per Scale for (A) Final AMD Scale, (B) Drusen, (C) Pigment, and (D) Late AMD. Manual and Manual + AI results represent paired comparisons from the same clinicians. The AI-only performance is from a single model and is shown for reference only; it is not directly comparable to the clinician results.
Figure 4.
Figure 4.
Diagnostic Time (seconds per patient) Efficiency with AI Assistance. (A) Mean and standard deviation of diagnostic times across four rounds. (B) The detailed individual clinician diagnostic time over four rounds, contrasting manual (top row in each round) with AI-assisted assessments (bottom row in each round). Darker shades indicate longer diagnostic times.

References

    1. Kumar Y., Koul A., Singla R. & Ijaz M. F. Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda. Journal of ambient intelligence and humanized computing 14, 8459–8486 (2023). - PMC - PubMed
    1. Andoh J. E., Ezekwesili A. C., Nwanyanwu K. & Elam A. Disparities in eye care access and utilization: a narrative review. Annual Review of Vision Science 9, 15–37 (2023).
    1. Rezaei M. et al. Role of artificial intelligence in the diagnosis and treatment of diseases. Kindle 3, 1–160 (2023).
    1. Nwanyanwu K. M., Nunez-Smith M., Gardner T. W. & Desai M. M. Awareness of diabetic retinopathy: insight from the national health and nutrition examination survey. American journal of preventive medicine 61, 900–909 (2021). - PMC - PubMed
    1. Peng Y. et al. DeepSeeNet: a deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology 126, 565–575 (2019). - PMC - PubMed

Publication types

LinkOut - more resources