Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis
- PMID: 41322980
- PMCID: PMC12660064
- DOI: 10.3389/fdata.2025.1682984
Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis
Abstract
Bangla Handwritten Character Recognition (BHCR) remains challenging due to complex alphabets, and handwriting variations. In this study, we present a comparative evaluation of three deep learning architectures-Vision Transformer (ViT), VGG-16, and ResNet-50-on the CMATERdb 3.1.2 dataset comprising 24,000 images of 50 basic Bangla characters. Our work highlights the effectiveness of ViT in capturing global context and long-range dependencies, leading to improved generalization. Experimental results show that ViT achieves a state-of-the-art accuracy of 98.26%, outperforming VGG-16 (94.54%) and ResNet-50 (93.12%). We also analyze model behavior, discuss overfitting in CNNs, and provide insights into character-level misclassifications. This study demonstrates the potential of transformer-based architectures for robust BHCR and offers a benchmark for future research.
Keywords: Bangla handwritten character recognition; ResNet-50; VGG-16; Vision Transformer (ViT); convolutional neural network; deep learning; optical character recognition.
Copyright © 2025 Shahariar Parvez, Samiul Islam, Al Farid, Yeasmin, Islam, Azam, Uddin and Abdul Karim.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
References
-
- Cheltha J. N., Sharma C., Prashar D., Khan A. A., Kadry S. (2024). Enhanced human motion detection with hybrid rda-woa-based RNN and multiple hypothesis tracking for occlusion handling. Image Vis. Comput. 150:105234. doi: 10.1016/j.imavis.2024.105234 - DOI
-
- Dipu N. M., Shohan S. A., Salam K. (2021). “Bangla optical character recognition (OCR) using deep learning based image classification algorithms,” in 2021 24th International Conference on Computer and Information Technology (ICCIT) (IEEE: ), 1–5. doi: 10.1109/ICCIT54785.2021.9689864 - DOI
-
- Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al. (2020). An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
-
- Geng S., Zhu Z., Wang Z., Dan Y., Li H. (2023). Lw-vit: the lightweight vision transformer model applied in offline handwritten Chinese character recognition. Electronics 12:1693. doi: 10.3390/electronics12071693 - DOI
-
- Ghosh T., Abedin M.-H.-Z., Al Banna H., Mumenin N., Abu Yousuf M. (2021). Performance analysis of state of the art convolutional neural network architectures in bangla handwritten character recognition. Patt. Recogn. Image Anal. 31, 60–71. doi: 10.1134/S1054661821010089 - DOI
LinkOut - more resources
Full Text Sources
