Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 14:8:1682984.
doi: 10.3389/fdata.2025.1682984. eCollection 2025.

Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis

Affiliations

Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis

A H M Shahariar Parvez et al. Front Big Data. .

Abstract

Bangla Handwritten Character Recognition (BHCR) remains challenging due to complex alphabets, and handwriting variations. In this study, we present a comparative evaluation of three deep learning architectures-Vision Transformer (ViT), VGG-16, and ResNet-50-on the CMATERdb 3.1.2 dataset comprising 24,000 images of 50 basic Bangla characters. Our work highlights the effectiveness of ViT in capturing global context and long-range dependencies, leading to improved generalization. Experimental results show that ViT achieves a state-of-the-art accuracy of 98.26%, outperforming VGG-16 (94.54%) and ResNet-50 (93.12%). We also analyze model behavior, discuss overfitting in CNNs, and provide insights into character-level misclassifications. This study demonstrates the potential of transformer-based architectures for robust BHCR and offers a benchmark for future research.

Keywords: Bangla handwritten character recognition; ResNet-50; VGG-16; Vision Transformer (ViT); convolutional neural network; deep learning; optical character recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
A block diagram of methodology.
Figure 2
Figure 2
Example of CMATERdb 3.1.2 dataset.
Figure 3
Figure 3
Sample images after preprocessing and augmentation.
Figure 4
Figure 4
Confusion matrices of the three models: (a) VGG16, (b) ResNet50, and (c) ViT.
Figure 5
Figure 5
Training and validation accuracy and loss curves for (a) VGG-16, (b) ResNet-50, and (c) Vision Transformer models on the CMATERdb 3.1.2 dataset.

References

    1. Cheltha J. N., Sharma C., Prashar D., Khan A. A., Kadry S. (2024). Enhanced human motion detection with hybrid rda-woa-based RNN and multiple hypothesis tracking for occlusion handling. Image Vis. Comput. 150:105234. doi: 10.1016/j.imavis.2024.105234 - DOI
    1. Dipu N. M., Shohan S. A., Salam K. (2021). “Bangla optical character recognition (OCR) using deep learning based image classification algorithms,” in 2021 24th International Conference on Computer and Information Technology (ICCIT) (IEEE: ), 1–5. doi: 10.1109/ICCIT54785.2021.9689864 - DOI
    1. Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al. (2020). An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
    1. Geng S., Zhu Z., Wang Z., Dan Y., Li H. (2023). Lw-vit: the lightweight vision transformer model applied in offline handwritten Chinese character recognition. Electronics 12:1693. doi: 10.3390/electronics12071693 - DOI
    1. Ghosh T., Abedin M.-H.-Z., Al Banna H., Mumenin N., Abu Yousuf M. (2021). Performance analysis of state of the art convolutional neural network architectures in bangla handwritten character recognition. Patt. Recogn. Image Anal. 31, 60–71. doi: 10.1134/S1054661821010089 - DOI

LinkOut - more resources