Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020:25:295-306.

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data

Affiliations

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data

Andrew L Beam et al. Pac Symp Biocomput. 2020.

Abstract

Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. To evaluate our approach, we present a new benchmark methodology based on statistical power specifically designed to test embeddings of medical concepts. Our approach, called cui2vec, attains state-of-the-art performance relative to previous methods in most instances. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Upset visualization of the intersection of medical concepts found in the insurance claims, clinical notes, and biomedical journal articles (PMC).

References

    1. Bengio Y, Ducharme R, Vincent P and Janvin C, A Neural Probabilistic Language Model, The Journal of Machine Learning Research 3, 1137 (2003).
    1. Berry MW, Dumais ST and O’Brien GW, Using Linear Algebra for Intelligent Information Retrieval, SIAM Review 37, 573 (1995).
    1. Lund K and Burgess C, Producing high-dimensional semantic spaces from lexical co-occurrence, Behavior Research Methods, Instruments, and Computers 28, 203 (1996).
    1. Harris ZS, Distributional Structure, WORD 10, 146 (1954).
    1. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J, Chen K, Dean J, Mikolov T and Chen K, Distributed Representations of Words and Phrases and their Compositionality., in NIPS’ 14, 2013.

Publication types

LinkOut - more resources