. 2026 Feb 25;21(2):e0342786.

doi: 10.1371/journal.pone.0342786. eCollection 2026.

ANCHOLIK-NER: A benchmark dataset for Bangla regional named entity recognition

Bidyarthi Paul¹, Faika Fairuj Preotee¹, Shuvashis Sarker¹, Shamim Rahim Refat², Shifat Islam³, Tashreef Muhammad¹, Mohammad Ashraful Hoque¹, Shahriar Manzoor¹

Affiliations

¹ Department of CSE, Southeast University, Dhaka, Bangladesh.
² Department of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.
³ Department of CSE, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.

PMID: 41739882
PMCID: PMC12935308
DOI: 10.1371/journal.pone.0342786

ANCHOLIK-NER: A benchmark dataset for Bangla regional named entity recognition

Bidyarthi Paul et al. PLoS One. 2026.

. 2026 Feb 25;21(2):e0342786.

doi: 10.1371/journal.pone.0342786. eCollection 2026.

Authors

Bidyarthi Paul¹, Faika Fairuj Preotee¹, Shuvashis Sarker¹, Shamim Rahim Refat², Shifat Islam³, Tashreef Muhammad¹, Mohammad Ashraful Hoque¹, Shahriar Manzoor¹

Affiliations

¹ Department of CSE, Southeast University, Dhaka, Bangladesh.
² Department of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh.
³ Department of CSE, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh.

PMID: 41739882
PMCID: PMC12935308
DOI: 10.1371/journal.pone.0342786

Abstract

Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences and 101,817 words annotated with 10 entity tags across 5 regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models-Bangla BERT, Bangla Bert Base, and BERT Base Multilingual Cased-on this dataset. Bangla BERT achieved the highest performance overall, with F1-scores of 82.27% (Mymensingh), 81.48% (Barishal), 78.75% (Sylhet), 78.50% (Noakhali), and 75.31% (Chittagong). These results highlight strong recognition capability in Mymensingh and Barishal, while dialectal variation in Chittagong remains challenging. As no prior NER resources exist for Bangla regional dialects, this work provides a foundational dataset and baseline benchmarks to facilitate future research. Future work will focus on dialect-aware model adaptation and expanding coverage to additional regions.

Copyright: © 2026 Paul et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Regional NER examples along with Standard Bangla and English.**

**Fig 2. Development of ANCHOLIK-NER: A systematic pipeline for dataset creation.**

**Fig 3. Inter-annotator agreement (Cohen’s Kappa) across different regions.**

**Fig 4. Average tagging speed (time per 1000 tokens) by region in minutes.**

**Fig 10. Frequency of named entities Chittagong dialects.**

**Fig 11. Frequency of named entities Barishal dialects.**

**Fig 12. Frequency of named entities Mymensingh dialects.**

**Fig 13. Frequency of named entities Sylhet dialects.**

**Fig 14. Frequency of named entities Noakhali dialects.**

**Fig 16. Confusion matrices for the best performing model across Barishal regional dialect.**

**Fig 17. Confusion matrices for the best performing model across Mymensingh regional dialect.**

**Fig 18. Confusion matrices for the best performing model across Chittagong regional dialect.**

**Fig 19. Confusion matrices for the best performing model across Noakhali regional dialect.**

**Fig 20. Confusion matrices for the best performing model across Sylhet regional dialect.**

See this image and copyright information in PMC

References

1. Grishman R, Sundheim BM. Message understanding conference-6: a brief history. In: COLING 1996 volume 1: The 16th international conference on computational linguistics. 1996.
1. Chinchor N, Robinson P. Appendix E: MUC-7 named entity task definition (version 3.5). In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29-May 1, 1998. 1998.
1. Le Meur C, Galliano S, Geoffrois E. Conventions d’annotations en entités nommées-ester. 2004.
1. Nadeau D, Sekine S. A survey of named entity recognition and classification. LI. 2007;30(1):3–26. doi: 10.1075/li.30.1.03nad - DOI
1. Mikheev A. A knowledge-free method for capitalized word disambiguation. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. 1999. p. 159–66. 10.3115/1034678.1034710 - DOI

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ANCHOLIK-NER: A benchmark dataset for Bangla regional named entity recognition

Affiliations

ANCHOLIK-NER: A benchmark dataset for Bangla regional named entity recognition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous