Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 15;4(5):e0000853.
doi: 10.1371/journal.pdig.0000853. eCollection 2025 May.

Development and validation of an AI algorithm to generate realistic and meaningful counterfactuals for retinal imaging based on diffusion models

Affiliations

Development and validation of an AI algorithm to generate realistic and meaningful counterfactuals for retinal imaging based on diffusion models

Indu Ilanchezian et al. PLOS Digit Health. .

Abstract

Counterfactual reasoning is often used by humans in clinical settings. For imaging based specialties such as ophthalmology, it would be beneficial to have an AI model that can create counterfactual images, illustrating answers to questions like "If the subject had had diabetic retinopathy, how would the fundus image have looked?". Such an AI model could aid in training of clinicians or in patient education through visuals that answer counterfactual queries. We used large-scale retinal image datasets containing color fundus photography (CFP) and optical coherence tomography (OCT) images to train ordinary and adversarially robust classifiers that classify healthy and disease categories. In addition, we trained an unconditional diffusion model to generate diverse retinal images including ones with disease lesions. During sampling, we then combined the diffusion model with classifier guidance to achieve realistic and meaningful counterfactual images maintaining the subject's retinal image structure. We found that our method generated counterfactuals by introducing or removing the necessary disease-related features. We conducted an expert study to validate that generated counterfactuals are realistic and clinically meaningful. Generated color fundus images were indistinguishable from real images and were shown to contain clinically meaningful lesions. Generated OCT images appeared realistic, but could be identified by experts with higher than chance probability. This shows that combining diffusion models with classifier guidance can achieve realistic and meaningful counterfactuals even for high-resolution medical images such as CFP images. Such images could be used for patient education or training of medical professionals.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. A. Original retinal fundus image, B. Visualization of counterfactuals with the healthy counterfactual on the left, DR counterfactual on the right and original image in middle, C. Method to generate diffusion counterfactuals.
Top shows the forward and reverse diffusion for an original image x0. Bottom shows generation of a DR DVC starting from the T2th time step. The mean of distributions in reverse diffusion is shifted using projected gradients (shown in dark orange) of an adversarially robust classifier (shown in brown) on a cone around the gradients of a plain classifier (shown in light orange), D. Images from the actual forward diffusion corresponding to the time steps shown in C. Physician icon from WikiMedia Commons under Creative Commons CC0 license.
Fig 2
Fig 2. Diffusion visual counterfactuals (DVCEs) show clinically meaningful changes and appear more realistic than sparse visual counterfactuals (SVCEs).
A. Color fundus image with signs of diabetic retinopathy (DR). Classifier confidence pϕ(DR)=0.99. GT stands for “ground truth”, the label assigned to the image in the dataset. B. DR SVC (left) and DVC (right) with pϕ(DR)=1.00 for both images. C. Healthy SVC (left) with pϕ(healthy)=1.00 and healthy DVC (right) with pϕ(healthy)=0.99. DVCs show realistically emphasized lesions (light green arrow) and new lesions (dark green arrow). DVC shows more realistic removal of disease related lesions whereas SVCs introduce artifacts (yellow arrow). D.-F., as A.-C., but for a healthy fundus image pϕ(healthy)=0.90. DR SVC: pϕ(DR)=0.98; DR DVC: pϕ(DR)=1.00; Healthy SVC: pϕ(healthy)=1.00; healthy DVC: pϕ(healthy)=0.99. All SVCs were generated with 4 norm and ϵ=0.3. DVCs were generated with 2 norm and regularization strength λ=0.5.
Fig 3
Fig 3. User study of realism of generated DVCs. We asked n = 4 AI researchers and n = 6 ophthalmologists to identify a counterfactual in a odd-one-out task with three images (two real and one counterfactual).
A. Overall fraction of correctly identified counterfactuals with binomial 95%-CI. Baseline at 33% chance level (dashed line). Grey dots and lines indicate individual graders. B. As in A. for the healthy and DR classes. C. As in A. for ophthalmologists and AI researchers.
Fig 4
Fig 4. User study of meaningfulness of generated DVCs.
A. We asked n = 5 ophthalmologists to classify a given set of fundus images into healthy and referable DR categories. The image set contained both real fundus images (indicated by green outline) and generated DVCs (blue outline). B. Overall fraction of correctly graded images with binomial 95%-CI for each subset. The performance of clinicians on the generated subset was comparable to that on the real subset showing that the generated DVCs faithfully introduce meaningful features of each class. Grey dots and lines show individual graders. Physician icon from WikiMedia Commons under Creative Commons CC0 license.
Fig 5
Fig 5. Comparison of DVCs generated using the plain model (top row), robust model (middle row) and cone projection of an adversarially robust model onto a plain model (bottom row).
A. A DR fundus image with pϕ(DR)=1.00 with a zoom in on patches with lesions. B. DR DVCs for the image from A. for the three different models. B. Difference maps between original DR image and the DR DVC show robust and cone projection models produce more realistic changes than the plain model
Fig 6
Fig 6. Effect of tuning the regularization strength λd on generated DVCs.
Decreasing λd allows for more changes on the original image. A. We start with a healthy image and generate DR DVCs with decreasing λd. More lesions are generated as λd decreases (light green arrows). B. We start with a DR image and generate healthy DVCs with different λd values. Some traces of the lesions were still visible for λd={0.7,0.5} while they were completely removed for λd=0.2 (dark green arrows) at the cost of some changes to the vessel structure (red arrows). While a higher λd=0.7 is sufficient to generate the minimum number of lesions required to convert a healthy fundus to DR, it is not sufficient to remove all lesions on a DR image to convert it to a healthy fundus.
Fig 7
Fig 7. DVCs for DR grading task with 5-classes: healthy, mild, moderate, severe and proliferative.
Images marked with * are original images with GT as indicated in the headline. All other images are DVCs with the headline specifying the target class. DVCs to the different classes from a A. healthy fundus, B. fundus with mild DR and C. fundus with moderate DR. Lesions which are originally present in initial image are indicated with light green arrows while lesions added by DVC are indicated with dark green arrows. In all cases, healthy DVCs removed all lesions. While the number and types of lesions introduced in mild and moderate DVCs are consistent with those observed in real-world data, severe and proliferative DVCs did not reflect the size and intensity of lesions seen in real examples. The model fails to generate larger lesions as seen in severe and proliferative classes due to the low representation of these classes in the data set. Other failure cases can be seen in S9 Fig.
Fig 8
Fig 8. DVCs for B-scans from optical coherence tomography (OCT) from healthy to various disease classes and vice-versa.
A. DVC from healthy to choroidal neovascularization (CNV) (top) and from CNV to healthy (bottom). B,C. Same as B for classes drusen ( N) and diabetic macular edema (DME) ( c). Similar to fundus DVCs, OCT DVCs show meaningful changes which are consistent with the important features of each class. DVCs from healthy images add features relevant to the disease (blue arrows). DVCs from diseased images to the healthy class remove the disease specific features seen on original image (green arrows). Upon visual inspection, OCT DVCs are more realistic than SVCs. For SVCs of the above OCT images, see S10 Fig.
Fig 9
Fig 9. Clinical evaluation of realism of generated OCT DVCs. We asked n = 4 AI researchers and n = 6 ophthalmologists to identify a DVC in a odd-one-out task with three images (two real and once DVC).
A. Overall fraction of correctly identified DVCs with binomial 95%-CI. Baseline at 33% (dashed line). B. As in A. OCT DVCs are easier to detected by ophthalmologists mainly because the changes are to and from more progressed disease stages, which require more attention to the global image structure, compared with fundus DVCs.

Similar articles

References

    1. Byrne RMJ. Counterfactual thought. Annu Rev Psychol. 2016;67:135–57. doi: 10.1146/annurev-psych-122414-033249 - DOI - PubMed
    1. Prosperi M, Guo Y, Sperrin M, Koopman JS, Min JS, He X, et al.. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat Mach Intell. 2020;2(7):369–75. doi: 10.1038/s42256-020-0197-y - DOI
    1. Lee SI, Topol EJ. The clinical potential of counterfactual AI models. Lancet. 2024;403(10428):717. doi: 10.1016/S0140-6736(24)00313-1 - DOI - PubMed
    1. Sanchez P, Tsaftaris SA. Diffusion causal models for counterfactual estimation. In: First Conference on Causal Learning and Reasoning; 2022. Available from: https://openreview.net/forum?id=LAAZLZIMN-o
    1. Boreiko V, Augustin M, Croce F, Berens P, Hein M. Sparse visual counterfactual explanations in image space. In: Pattern recognition. Springer; 2022. p. 133–48.