A method of inferring the relationship between Biomedical entities through correlation analysis on text
- PMID: 30396345
- PMCID: PMC6218997
- DOI: 10.1186/s12938-018-0583-4
A method of inferring the relationship between Biomedical entities through correlation analysis on text
Abstract
Background: One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand, it is known that word embedding has a great effect in estimating the similarity between words because it expresses the meaning of the word well. In this study, we try to clarify the correlation between various terms in the biomedical texts based on the excellent ability of estimating similarity between words shown by word embedding. Therefore, we used word embedding to find new biomarkers and microorganisms related to a specific diseases.
Methods: In this study, we try to analyze the correlation between diseases-markers and diseases-microorganisms. First, we need to construct a corpus that seems to be related to them. To do this, we extract the titles and abstracts from the biomedical texts on the PubMed site. Second, we express diseases, markers, and microorganisms' terms in word embedding using Canonical Correlation Analysis (CCA). CCA is a statistical based methodology that has a very good performance on vector dimension reduction. Finally, we tried to estimate the relationship between diseases-markers pairs and diseases-microorganisms pairs by measuring their similarity.
Results: In the experiment, we tried to confirm the correlation derived through word embedding using Google Scholar search results. Of the top 20 highly correlated disease-marker pairs, about 85% of the pairs have actually undergone a lot of research as a result of Google Scholars search. Conversely, for 85% of the 20 pairs with the lowest correlation, we could not actually find any other study to determine the relationship between the disease and the marker. This trend was similar for disease-microbe pairs.
Conclusions: The correlation between diseases and markers and diseases and microorganisms calculated through word embedding reflects actual research trends. If the word-embedding correlation is high, but there are not many published actual studies, additional research can be proposed for the pair.
Keywords: Bio-marker; Canonical Correlation Analysis (CCA); Lexical similarity; Microorganisms; Word embedding; t-distributed stochastic neighbor embedding (t-SNE).
Figures
Similar articles
-
An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining.IEEE/ACM Trans Comput Biol Bioinform. 2016 Jul-Aug;13(4):634-42. doi: 10.1109/TCBB.2015.2478467. Epub 2015 Sep 14. IEEE/ACM Trans Comput Biol Bioinform. 2016. PMID: 26390497
-
Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023. Comput Intell Neurosci. 2023. PMID: 39262497 Free PMC article.
-
A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation.JMIR Med Inform. 2021 Jun 24;9(6):e29667. doi: 10.2196/29667. JMIR Med Inform. 2021. PMID: 34185005 Free PMC article.
-
Word embedding for social sciences: an interdisciplinary survey.PeerJ Comput Sci. 2024 Dec 5;10:e2562. doi: 10.7717/peerj-cs.2562. eCollection 2024. PeerJ Comput Sci. 2024. PMID: 39896392 Free PMC article. Review.
-
Application of Machine Learning in Microbiology.Front Microbiol. 2019 Apr 18;10:827. doi: 10.3389/fmicb.2019.00827. eCollection 2019. Front Microbiol. 2019. PMID: 31057526 Free PMC article. Review.
Cited by
-
Challenges in the construction of knowledge bases for human microbiome-disease associations.Microbiome. 2019 Sep 5;7(1):129. doi: 10.1186/s40168-019-0742-2. Microbiome. 2019. PMID: 31488215 Free PMC article. Review.
-
Development of a dynamic network biomarkers method and its application for detecting the tipping point of prior disease development.Comput Struct Biotechnol J. 2022 Feb 24;20:1189-1197. doi: 10.1016/j.csbj.2022.02.019. eCollection 2022. Comput Struct Biotechnol J. 2022. PMID: 35317238 Free PMC article. Review.
-
How can natural language processing help model informed drug development?: a review.JAMIA Open. 2022 Jun 11;5(2):ooac043. doi: 10.1093/jamiaopen/ooac043. eCollection 2022 Jul. JAMIA Open. 2022. PMID: 35702625 Free PMC article. Review.
References
-
- Biomarker—Wikipedia. https://en.wikipedia.org/wiki/Biomarker. Accessed 11 Apr 2018.
-
- Microorganism—Wikipedia. https://en.wikipedia.org/wiki/Microorganism. Accessed 11 Apr 2018.
-
- Srinivas PR, Verma M, Zhao Y, Srivastava S. Proteomics for cancer biomarker discovery. Clin Chem. 2002;48:1160–1169. - PubMed
-
- Nam KM, Song HJ, Kim JD, Park CY, Kim YS. Detection of alternative ovarian cancer biomarker via word embedding. Int J Softw Eng Appl. 2016;10:1–12.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources