Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings
- PMID: 30654030
- PMCID: PMC6557457
- DOI: 10.1016/j.jbi.2019.103096
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings
Abstract
Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.
Keywords: Dimensional reduction; Implicit features; Natural language processing; Pvtopic; Semantic similarity; Text mining; Vector representation; Word2vec.
Copyright © 2019 Elsevier Inc. All rights reserved.
Similar articles
-
A comparison of word embeddings for the biomedical natural language processing.J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12. J Biomed Inform. 2018. PMID: 30217670 Free PMC article.
-
PMCVec: Distributed phrase representation for biomedical text processing.J Biomed Inform. 2019;100S:100047. doi: 10.1016/j.yjbinx.2019.100047. Epub 2019 Jul 20. J Biomed Inform. 2019. PMID: 34384576
-
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021. PLoS One. 2021. PMID: 34653224 Free PMC article.
-
Visualization of medical concepts represented using word embeddings: a scoping review.BMC Med Inform Decis Mak. 2022 Mar 29;22(1):83. doi: 10.1186/s12911-022-01822-9. BMC Med Inform Decis Mak. 2022. PMID: 35351120 Free PMC article.
-
A survey of word embeddings for clinical text.J Biomed Inform. 2019;100S:100057. doi: 10.1016/j.yjbinx.2019.100057. Epub 2019 Oct 28. J Biomed Inform. 2019. PMID: 34384583 Review.
Cited by
-
Anne O'Tate: Value-added PubMed search engine for analysis and text mining.PLoS One. 2021 Mar 8;16(3):e0248335. doi: 10.1371/journal.pone.0248335. eCollection 2021. PLoS One. 2021. PMID: 33684153 Free PMC article.
-
Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.Data Inf Manag. 2018 Jun;2(1):27-36. doi: 10.2478/dim-2018-0004. Epub 2018 May 22. Data Inf Manag. 2018. PMID: 30766970 Free PMC article.
-
Refining electronic medical records representation in manifold subspace.BMC Bioinformatics. 2022 Apr 1;23(1):115. doi: 10.1186/s12859-022-04653-7. BMC Bioinformatics. 2022. PMID: 35365092 Free PMC article.
-
BioWordVec, improving biomedical word embeddings with subword information and MeSH.Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0. Sci Data. 2019. PMID: 31076572 Free PMC article.
-
A web-based tool for automatically linking clinical trials to their publications.J Am Med Inform Assoc. 2022 Apr 13;29(5):822-830. doi: 10.1093/jamia/ocab290. J Am Med Inform Assoc. 2022. PMID: 35020887 Free PMC article.
References
-
- Pedersen T, Pakhomov SVS, Patwardhan S, and Chute CG Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics. 2007;40(3):288–299. - PubMed
-
- Mrabet Y, Kilicoglu H, Demner-Fushman D. TextFlow: A Text Similarity Measure based on Continuous Sequences. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017 (Vol. 1, pp. 763–772).
-
- Lesk ME 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference 1986, Toronto, June.
-
- Mohammadi S, Kylasa S, Kollias G, Grama A. Context-Specific Recommendation System for Predicting Similar PubMed Articles. InData Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on 2016 Dec 12 (pp. 1007–1014). IEEE.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources