Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 4;11(1):455.
doi: 10.1038/s41597-024-03317-w.

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models

Affiliations

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models

Hossein Rouhizadeh et al. Sci Data. .

Abstract

Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding context. Inspired by the Word-in-Context (WiC) benchmark, in which word sense disambiguation is reformulated as a binary classification task, we propose a novel dataset, BioWiC, to evaluate the ability of language models to encode biomedical terms in context. BioWiC comprises 20'156 instances, covering over 7'400 unique biomedical terms, making it the largest WiC dataset in the biomedical domain. We evaluate BioWiC both intrinsically and extrinsically and show that it could be used as a reliable benchmark for evaluating context-dependent embeddings in biomedical corpora. In addition, we conduct several experiments using a variety of discriminative and generative large language models to establish robust baselines that can serve as a foundation for future research.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Illustration of concept ambiguity in the biomedical domain. Left: Example of the UMLS 2021AB data structure, where one term refers to different concepts as well and one concept may be represented with different mentions. Right: Example of a paragraph with numerous polysemous acronyms and abbreviations from a biomedical journal. Acronyms and abbreviations are highlighted in bold.
Fig. 2
Fig. 2
The overall pipeline of the BioWiC construction process. Step 1: Pre-process the source documents to a consistent format. Step 2: Identify and retrieve sentences including the term “delivery” linked to UMLS. Step 3: Pair the retrieved sentences to generate BioWiC instances. In Step 3, the green box shows an example of a BioWiC instance with the same target concept, while the red boxes show examples of different target concepts.
Fig. 3
Fig. 3
Impact of different thresholds for max sentence repetition in the training set. Left: Impact on the training set size; Center: Impact on the frequency of unique concepts; Right: Impact on the frequency of unique UMLS semantic types.
Fig. 4
Fig. 4
Distribution of UMLS semantic types and semantic groups in BioWiC. Left: Top 10 semantic types; Right: Top 10 semantic groups.
Fig. 5
Fig. 5
Accuracy of the baseline models on the BioWiC test set. ++ indicates that data from WiC was added to the training set. Min, mean, median, and max statistics exclude the random performance.

References

    1. Detroja, K., Bhensdadia, C. & Bhatt, B. S. A survey on relation extraction. Intell. Syst. with Appl. 200244 (2023).
    1. Shi, J. et al. Knowledge-graph-enabled biomedical entity linking: a survey. World Wide Web 1–30 (2023).
    1. French, E. & McInnes, B. T. An overview of biomedical entity linking throughout the years. J. Biomed. Informatic 104252 (2022). - PMC - PubMed
    1. Yazdani, A., Proios, D., Rouhizadeh, H. & Teodoro, D. Efficient joint learning for clinical named entity recognition and relation extraction using Fourier networks:a use case in adverse drug events. In Akhtar, M. S. & Chakraborty, T. (eds.) Proceedings of the 19th International Conference on Natural Language Processing (ICON), 212–223 (Association for Computational Linguistics, New Delhi, India, 2022)
    1. Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of deep masked language models for effective named entity recognition in health and life science corpora. Front. research metrics analytics. 2021;6:689803. doi: 10.3389/frma.2021.689803. - DOI - PMC - PubMed

Publication types