Localizing in-domain adaptation of transformer-based biomedical language models

Tommaso Mario Buonocore¹, Claudio Crema², Alberto Redolfi², Riccardo Bellazzi³, Enea Parimbelli³

Affiliations

¹ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy. Electronic address: buonocore.tms@gmail.com.
² Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, 25125, Italy.
³ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy.

PMID: 37385327
DOI: 10.1016/j.jbi.2023.104431

Free article

Localizing in-domain adaptation of transformer-based biomedical language models

Tommaso Mario Buonocore et al. J Biomed Inform. 2023 Aug.

Free article

. 2023 Aug:144:104431.

doi: 10.1016/j.jbi.2023.104431. Epub 2023 Jun 28.

Authors

Tommaso Mario Buonocore¹, Claudio Crema², Alberto Redolfi², Riccardo Bellazzi³, Enea Parimbelli³

Affiliations

¹ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy. Electronic address: buonocore.tms@gmail.com.
² Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, 25125, Italy.
³ Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy.

PMID: 37385327
DOI: 10.1016/j.jbi.2023.104431

Abstract

In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Keywords: Biomedical text mining; Deep learning; Language model; Natural language processing; Transformer.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Localizing in-domain adaptation of transformer-based biomedical language models

Affiliations

Localizing in-domain adaptation of transformer-based biomedical language models

Authors

Affiliations

Abstract

Conflict of interest statement

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous