. 2021 Jun 24;9(6):e29667.

doi: 10.2196/29667.

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation

Yunjin Yum^{1

2}, Jeong Moon Lee², Moon Joung Jang², Yoojoong Kim², Jong-Ho Kim^{2

3}, Seongtae Kim⁴, Unsub Shin⁴, Sanghoun Song^#⁴, Hyung Joon Joo^#^{3

5

6}

Affiliations

¹ Department of Biostatistics, Korea University College of Medicine, Seoul, Republic of Korea.
² Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea.
³ Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea.
⁴ Department of Linguistics, Korea University, Seoul, Republic of Korea.
⁵ Korea University Research Institute for Medical Bigdata Science, Korea University Anam Hospital, Seoul, Republic of Korea.
⁶ Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea.

^# Contributed equally.

PMID: 34185005
PMCID: PMC8277378
DOI: 10.2196/29667

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation

Yunjin Yum et al. JMIR Med Inform. 2021.

. 2021 Jun 24;9(6):e29667.

doi: 10.2196/29667.

Authors

Yunjin Yum^{1

2}, Jeong Moon Lee², Moon Joung Jang², Yoojoong Kim², Jong-Ho Kim^{2

3}, Seongtae Kim⁴, Unsub Shin⁴, Sanghoun Song^#⁴, Hyung Joon Joo^#^{3

5

6}

Affiliations

¹ Department of Biostatistics, Korea University College of Medicine, Seoul, Republic of Korea.
² Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea.
³ Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea.
⁴ Department of Linguistics, Korea University, Seoul, Republic of Korea.
⁵ Korea University Research Institute for Medical Bigdata Science, Korea University Anam Hospital, Seoul, Republic of Korea.
⁶ Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea.

^# Contributed equally.

PMID: 34185005
PMCID: PMC8277378
DOI: 10.2196/29667

Abstract

Background: The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences.

Objective: We propose a new Korean word pair reference set to verify embedding models.

Methods: From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs.

Results: The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30).

Conclusions: Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.

Keywords: Korean; fastText; medical word pair; relatedness; similarity; word embedding.

©Yunjin Yum, Jeong Moon Lee, Moon Joung Jang, Yoojoong Kim, Jong-Ho Kim, Seongtae Kim, Unsub Shin, Sanghoun Song, Hyung Joon Joo. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 24.06.2021.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Score distribution plots of the participants (n=16 attending physicians from a tertiary hospital); the distributions of the scores of participants 1, 5, and 13 were absolutely skewed and were therefore excluded from further analyses.

**Figure 2**
Scatter plot of the correlation between the similarity and relatedness tasks.

**Figure 3**
Correlation between cosine distance from FastText and the human evaluations from 13 attending physicians.

**Figure 4**
Correlation between the reference scores for the original University of Minnesota Semantic Relatedness Set (UMNSRS) word pair sets (English version) and the scores from 12 health information managers for the Korean translation version.

**Figure 5**
Correlation between the cosine distance of FastText embedding models and the human evaluations by 12 health information managers of the Korean version of the University of Minnesota Semantic Relatedness Set (UMNSRS) word pair sets.

See this image and copyright information in PMC

Cited by

Fine-tuned Sentiment Analysis of COVID-19 Vaccine-Related Social Media Data: Comparative Study.
Melton CA, White BM, Davis RL, Bednarczyk RA, Shaban-Nejad A. Melton CA, et al. J Med Internet Res. 2022 Oct 17;24(10):e40408. doi: 10.2196/40408. J Med Internet Res. 2022. PMID: 36174192 Free PMC article.
A pre-trained BERT for Korean medical natural language processing.
Kim Y, Kim JH, Lee JM, Jang MJ, Yum YJ, Kim S, Shin U, Kim YM, Joo HJ, Song S. Kim Y, et al. Sci Rep. 2022 Aug 16;12(1):13847. doi: 10.1038/s41598-022-17806-8. Sci Rep. 2022. PMID: 35974113 Free PMC article.

References

1. Safi Z, Abd-Alrazaq A, Khalifa M, Househ M. Technical Aspects of Developing Chatbots for Medical Applications: Scoping Review. J Med Internet Res. 2020 Dec 18;22(12):e19127. doi: 10.2196/19127. https://www.jmir.org/2020/12/e19127/ - DOI - PMC - PubMed
1. Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0. doi: 10.1038/s41597-019-0055-0. - DOI - DOI - PMC - PubMed
1. Chen Q, Lee K, Yan S, Kim S, Wei C, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol. 2020 Apr;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. https://dx.plos.org/10.1371/journal.pcbi.1007617 - DOI - DOI - PMC - PubMed
1. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020 Feb 15;36(4):1234–1240. doi: 10.1093/bioinformatics/btz682. http://europepmc.org/abstract/MED/31501885 - DOI - PMC - PubMed
1. Wajsbürt P, Sarfati A, Tannier X. Medical concept normalization in French using multilingual terminologies and contextual embeddings. J Biomed Inform. 2021 Feb;114:103684. doi: 10.1016/j.jbi.2021.103684. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation

Affiliations

A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Miscellaneous