BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting

Hazal Türkmen¹, Oğuz Dikenelli¹, Cenk Eraslan², Mehmet Cem Çallı², Süha Süreyya Özbek²

Affiliations

¹ Department of Computer Engineering, Ege University, 35100 İzmir, Turkey.
² Department of Radiology, Ege University, 35100 İzmir, Turkey.

PMID: 37927378
PMCID: PMC10620363
DOI: 10.1007/s41666-023-00140-7

BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting

Hazal Türkmen et al. J Healthc Inform Res. 2023.

. 2023 Sep 19;7(4):433-446.

doi: 10.1007/s41666-023-00140-7. eCollection 2023 Dec.

Authors

Hazal Türkmen¹, Oğuz Dikenelli¹, Cenk Eraslan², Mehmet Cem Çallı², Süha Süreyya Özbek²

Affiliations

¹ Department of Computer Engineering, Ege University, 35100 İzmir, Turkey.
² Department of Radiology, Ege University, 35100 İzmir, Turkey.

PMID: 37927378
PMCID: PMC10620363
DOI: 10.1007/s41666-023-00140-7

Abstract

Pretrained language models augmented with in-domain corpora show impressive results in biomedicine and clinical Natural Language Processing (NLP) tasks in English. However, there has been minimal work in low-resource languages. Although some pioneering works have shown promising results, many scenarios still need to be explored to engineer effective pretrained language models in biomedicine for low-resource settings. This study introduces the BioBERTurk family and four pretrained models in Turkish for biomedicine. To evaluate the models, we also introduced a labeled dataset to classify radiology reports of head CT examinations. Two parts of the reports, impressions and findings, are evaluated separately to observe the performance of models on longer and less informative text. We compared the models with the Turkish BERT (BERTurk) pretrained with general domain text, multilingual BERT (mBERT), and LSTM+attention-based baseline models. The first model initialized from BERTurk and then further pretrained with biomedical corpus performs statistically better than BERTurk, multilingual BERT, and baseline for both datasets. The second model continues to pretrain the BERTurk model by using only radiology Ph.D. theses to test the effect of task-related text. This model slightly outperformed all models on the impression dataset and showed that using only radiology-related data for continual pre-training could be effective. The third model continues to pretrain by adding radiology theses to the biomedical corpus but does not show a statistically meaningful difference for both datasets. The final model combines radiology and biomedicine corpora with the corpus of BERTurk and pretrains a BERT model from scratch. This model is the worst-performing model of the BioBERT family, even worse than BERTurk and multilingual BERT.

Keywords: Biomedicine; Pretrained language model; Radiology reports; Transformer.

© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

PubMed Disclaimer

Conflict of interest statement

Conflicts of InterestThe authors declare no competing interests.

Figures

**Fig. 1**
Vocabulary overlap ratio (%) between domains

See this image and copyright information in PMC

References

1. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
1. Kalyan KS, Rajasekharan A, Sangeetha S (2021) Ammu: a survey of transformer-based biomedical pretrained language models. J Biomed Inform 103982 - PubMed
1. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–1240. doi: 10.1093/bioinformatics/btz682. - DOI - PMC - PubMed
1. Alsentzer E, Murphy JR, Boag W, Weng W-H, Jin D, Naumann T, McDermott M (2019) Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323
1. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 2021;3(1):1–23.

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting

Affiliations

BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources