Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis

A H M Shahariar Parvez¹, Md Samiul Islam², Fahmid Al Farid³, Tashida Yeasmin⁴, Md Monirul Islam⁵, Md Shafiul Azam⁶, Jia Uddin⁷, Hezerul Abdul Karim³

Affiliations

¹ Department of Software Engineering, Daffodil International University, Dhaka, Bangladesh.
² Department of Computer Science and Engineering, State University of Bangladesh, Dhaka, Bangladesh.
³ Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya, Selangor, Malaysia.
⁴ Department of Computer Science and Engineering, Atish Dipankar University, Dhaka, Bangladesh.
⁵ Department of Information and Communications Engineering, Hankuk University of Foreign Studies, Seoul, Republic of Korea.
⁶ Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.
⁷ Artificial Intelligence and Big Data Department, Woosong University, Daejeon, Republic of Korea.

PMID: 41322980
PMCID: PMC12660064
DOI: 10.3389/fdata.2025.1682984

Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis

A H M Shahariar Parvez et al. Front Big Data. 2025.

. 2025 Nov 14:8:1682984.

doi: 10.3389/fdata.2025.1682984. eCollection 2025.

Authors

A H M Shahariar Parvez¹, Md Samiul Islam², Fahmid Al Farid³, Tashida Yeasmin⁴, Md Monirul Islam⁵, Md Shafiul Azam⁶, Jia Uddin⁷, Hezerul Abdul Karim³

Affiliations

¹ Department of Software Engineering, Daffodil International University, Dhaka, Bangladesh.
² Department of Computer Science and Engineering, State University of Bangladesh, Dhaka, Bangladesh.
³ Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya, Selangor, Malaysia.
⁴ Department of Computer Science and Engineering, Atish Dipankar University, Dhaka, Bangladesh.
⁵ Department of Information and Communications Engineering, Hankuk University of Foreign Studies, Seoul, Republic of Korea.
⁶ Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.
⁷ Artificial Intelligence and Big Data Department, Woosong University, Daejeon, Republic of Korea.

PMID: 41322980
PMCID: PMC12660064
DOI: 10.3389/fdata.2025.1682984

Abstract

Bangla Handwritten Character Recognition (BHCR) remains challenging due to complex alphabets, and handwriting variations. In this study, we present a comparative evaluation of three deep learning architectures-Vision Transformer (ViT), VGG-16, and ResNet-50-on the CMATERdb 3.1.2 dataset comprising 24,000 images of 50 basic Bangla characters. Our work highlights the effectiveness of ViT in capturing global context and long-range dependencies, leading to improved generalization. Experimental results show that ViT achieves a state-of-the-art accuracy of 98.26%, outperforming VGG-16 (94.54%) and ResNet-50 (93.12%). We also analyze model behavior, discuss overfitting in CNNs, and provide insights into character-level misclassifications. This study demonstrates the potential of transformer-based architectures for robust BHCR and offers a benchmark for future research.

Keywords: Bangla handwritten character recognition; ResNet-50; VGG-16; Vision Transformer (ViT); convolutional neural network; deep learning; optical character recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
A block diagram of methodology.

**Figure 2**
Example of CMATERdb 3.1.2 dataset.

**Figure 3**
Sample images after preprocessing and augmentation.

**Figure 4**
Confusion matrices of the three models: **(a)** VGG16, **(b)** ResNet50, and **(c)** ViT.

**Figure 5**
Training and validation accuracy and loss curves for **(a)** VGG-16, **(b)** ResNet-50, and **(c)** Vision Transformer models on the CMATERdb 3.1.2 dataset.

See this image and copyright information in PMC

References

1. Cheltha J. N., Sharma C., Prashar D., Khan A. A., Kadry S. (2024). Enhanced human motion detection with hybrid rda-woa-based RNN and multiple hypothesis tracking for occlusion handling. Image Vis. Comput. 150:105234. doi: 10.1016/j.imavis.2024.105234 - DOI
1. Dipu N. M., Shohan S. A., Salam K. (2021). “Bangla optical character recognition (OCR) using deep learning based image classification algorithms,” in 2021 24th International Conference on Computer and Information Technology (ICCIT) (IEEE: ), 1–5. doi: 10.1109/ICCIT54785.2021.9689864 - DOI
1. Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al. (2020). An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
1. Geng S., Zhu Z., Wang Z., Dan Y., Li H. (2023). Lw-vit: the lightweight vision transformer model applied in offline handwritten Chinese character recognition. Electronics 12:1693. doi: 10.3390/electronics12071693 - DOI
1. Ghosh T., Abedin M.-H.-Z., Al Banna H., Mumenin N., Abu Yousuf M. (2021). Performance analysis of state of the art convolutional neural network architectures in bangla handwritten character recognition. Patt. Recogn. Image Anal. 31, 60–71. doi: 10.1134/S1054661821010089 - DOI

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis

Affiliations

Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources