A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models

Affiliations

¹ Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland. hossein.rouhizadeh@unige.ch.
² Department of Informatics, University of Hamburg, Hamburg, Germany.
³ Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
⁴ Division of Medical Information Sciences, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland.
⁵ Laboratoire Interdisciplinaire des Sciences du Numerique, CNRS, Paris-Saclay University, Orsay, France.
⁶ Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland. douglas.teodoro@unige.ch.

PMID: 38704422
PMCID: PMC11069517
DOI: 10.1038/s41597-024-03317-w

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models

Hossein Rouhizadeh et al. Sci Data. 2024.

. 2024 May 4;11(1):455.

doi: 10.1038/s41597-024-03317-w.

Authors

Affiliations

¹ Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland. hossein.rouhizadeh@unige.ch.
² Department of Informatics, University of Hamburg, Hamburg, Germany.
³ Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
⁴ Division of Medical Information Sciences, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland.
⁵ Laboratoire Interdisciplinaire des Sciences du Numerique, CNRS, Paris-Saclay University, Orsay, France.
⁶ Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland. douglas.teodoro@unige.ch.

PMID: 38704422
PMCID: PMC11069517
DOI: 10.1038/s41597-024-03317-w

Abstract

Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding context. Inspired by the Word-in-Context (WiC) benchmark, in which word sense disambiguation is reformulated as a binary classification task, we propose a novel dataset, BioWiC, to evaluate the ability of language models to encode biomedical terms in context. BioWiC comprises 20'156 instances, covering over 7'400 unique biomedical terms, making it the largest WiC dataset in the biomedical domain. We evaluate BioWiC both intrinsically and extrinsically and show that it could be used as a reliable benchmark for evaluating context-dependent embeddings in biomedical corpora. In addition, we conduct several experiments using a variety of discriminative and generative large language models to establish robust baselines that can serve as a foundation for future research.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Illustration of concept ambiguity in the biomedical domain. Left: Example of the UMLS 2021AB data structure, where one term refers to different concepts as well and one concept may be represented with different mentions. Right: Example of a paragraph with numerous polysemous acronyms and abbreviations from a biomedical journal. Acronyms and abbreviations are highlighted in bold.

**Fig. 2**
The overall pipeline of the BioWiC construction process. Step 1: Pre-process the source documents to a consistent format. Step 2: Identify and retrieve sentences including the term “delivery” linked to UMLS. Step 3: Pair the retrieved sentences to generate BioWiC instances. In Step 3, the green box shows an example of a BioWiC instance with the same target concept, while the red boxes show examples of different target concepts.

**Fig. 3**
Impact of different thresholds for max sentence repetition in the training set. Left: Impact on the training set size; Center: Impact on the frequency of unique concepts; Right: Impact on the frequency of unique UMLS semantic types.

**Fig. 4**
Distribution of UMLS semantic types and semantic groups in BioWiC. Left: Top 10 semantic types; Right: Top 10 semantic groups.

**Fig. 5**
Accuracy of the baseline models on the BioWiC test set. ++ indicates that data from WiC was added to the training set. Min, mean, median, and max statistics exclude the random performance.

See this image and copyright information in PMC

References

1. Detroja, K., Bhensdadia, C. & Bhatt, B. S. A survey on relation extraction. Intell. Syst. with Appl. 200244 (2023).
1. Shi, J. et al. Knowledge-graph-enabled biomedical entity linking: a survey. World Wide Web 1–30 (2023).
1. French, E. & McInnes, B. T. An overview of biomedical entity linking throughout the years. J. Biomed. Informatic 104252 (2022). - PMC - PubMed
1. Yazdani, A., Proios, D., Rouhizadeh, H. & Teodoro, D. Efficient joint learning for clinical named entity recognition and relation extraction using Fourier networks:a use case in adverse drug events. In Akhtar, M. S. & Chakraborty, T. (eds.) Proceedings of the 19th International Conference on Natural Language Processing (ICON), 212–223 (Association for Computational Linguistics, New Delhi, India, 2022)
1. Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of deep masked language models for effective named entity recognition in health and life science corpora. Front. research metrics analytics. 2021;6:689803. doi: 10.3389/frma.2021.689803. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models

Affiliations

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous