. 2025 Jul 1;15(1):20655.

doi: 10.1038/s41598-025-07561-x.

Transformer attention fusion for fine grained medical image classification

Danyal Badar¹, Junaid Abbas², Raed Alsini³, Tahir Abbas⁴, Wang ChengLiang⁵, Ali Daud⁶

Affiliations

¹ College of Computer Science, Chongqing University, Chongqing, China.
² School of Big Data and Software Engineering, Chongqing University, Chongqing, China.
³ Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia.
⁴ Department of Computer Science, TIMES Institute, Multan, 60000, Pakistan.
⁵ College of Computer Science, Chongqing University, Chongqing, China. wangcl@cqu.edu.cn.
⁶ Faculty of Resilience, Rabdan Academy, Abu Dhabi, United Arab Emirates. alimsdb@gmail.com.

PMID: 40596233
PMCID: PMC12216456
DOI: 10.1038/s41598-025-07561-x

Transformer attention fusion for fine grained medical image classification

Danyal Badar et al. Sci Rep. 2025.

. 2025 Jul 1;15(1):20655.

doi: 10.1038/s41598-025-07561-x.

Authors

Danyal Badar¹, Junaid Abbas², Raed Alsini³, Tahir Abbas⁴, Wang ChengLiang⁵, Ali Daud⁶

Affiliations

¹ College of Computer Science, Chongqing University, Chongqing, China.
² School of Big Data and Software Engineering, Chongqing University, Chongqing, China.
³ Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia.
⁴ Department of Computer Science, TIMES Institute, Multan, 60000, Pakistan.
⁵ College of Computer Science, Chongqing University, Chongqing, China. wangcl@cqu.edu.cn.
⁶ Faculty of Resilience, Rabdan Academy, Abu Dhabi, United Arab Emirates. alimsdb@gmail.com.

PMID: 40596233
PMCID: PMC12216456
DOI: 10.1038/s41598-025-07561-x

Abstract

Fine-grained visual classification is fundamental for medical image applications because it detects minor lesions. Diabetic retinopathy (DR) is a preventable cause of blindness, which requires exact and timely diagnosis to prevent vision damage. The challenges automated DR classification systems face include irregular lesions, uneven distributions between image classes, and inconsistent image quality that reduces diagnostic accuracy during early detection stages. Our solution to these problems includes MSCAS-Net (Multi-Scale Cross and Self-Attention Network), which uses the Swin Transformer as the backbone. It extracts features at three different resolutions (12 × 12, 24 × 24, 48 × 48), allowing it to detect subtle local features and global elements. This model uses self-attention mechanics to improve spatial connections between single scales and cross-attention to automatically match feature patterns across multiple scales, thereby developing a comprehensive information structure. The model becomes better at detecting significant lesions because of its dual mechanism, which focuses on both attention points. MSCAS-Net displays the best performance on APTOS and DDR and IDRID benchmarks by reaching accuracy levels of 93.8%, 89.80% and 86.70%, respectively. Through its algorithm, the model solves problems with imbalanced datasets and inconsistent image quality without needing data augmentation because it learns stable features. MSCAS-Net demonstrates a breakthrough in automated DR diagnostics since it combines high diagnostic precision with interpretable abilities to become an efficient AI-powered clinical decision support system. The presented research demonstrates how fine-grained visual classification methods benefit detecting and treating DR during its early stages.

Keywords: Attention mechanism; Deep learning; Diabetic retinopathy classification; Fine-grained visual classification; Medical images; Multi-scale feature extraction.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
MSCAS-Net (multi-scale cross and self-attention network) architecture.

**Fig. 4**
Training and validation accuracy curves for APTOS dataset.

**Fig. 5**
Training and validation loss curves for APTOS dataset.

**Fig. 6**
Training and validation accuracy curves for DDR dataset.

**Fig. 7**
Training and validation loss curves for DDR dataset.

**Fig. 8**
Training and validation accuracy curves for IDRID dataset.

**Fig. 9**
Training and validation loss curves for IDRID dataset.

**Fig. 10**
A visual example of heatmaps generated using this model on three datasets (a) APTOS (b) DDR and (c) IDRID.

See this image and copyright information in PMC

References

1. Salud, O. M. d.l. Organización Mundial de la Salud. https://www.who.int/es/news-room/fact-sheets/detail/diabetes.
1. Hegde, A. & Sumana, K. R. Comparative study of diabetic retinopathy detection using machine learning techniques. Int. J. Res. Appl. Sci. Eng. Technol. (2022).
1. Wan, S., Liang, Y. & Zhang, Y. Deep convolutional neural networks for diabetic retinopathy detection by image classification. Computers Electr. Eng.72, 274–282 (2018). - DOI
1. Abbas, Q. et al. HDR-EfficientNet: a classification of hypertensive and diabetic retinopathy using optimize Efficientnet architecture. Diagnostics13(20), 3236 (2023). - DOI - PMC - PubMed
1. Harithalakshmi, K., Rajan, R. & Nadheera, K. EfficientNet-based diabetic retinopathy classification using data augmentation. In 2023 9th International Conference on Smart Computing and Communications (ICSCC). (IEEE, 2023).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Transformer attention fusion for fine grained medical image classification

Affiliations

Transformer attention fusion for fine grained medical image classification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical