Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 27;4(1):151.
doi: 10.1038/s41746-021-00519-z.

Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data

Affiliations

Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data

Chuan Hong et al. NPJ Digit Med. .

Abstract

The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease of interest. We constructed large-scale code embeddings for a wide range of codified concepts from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Features identified via KESER resulted in comparable performance to those built upon features selected manually or with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among codified concepts. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of KESER procedure.
The KESER procedure includes four steps: (i) data pre-processing; (ii) representation learning using co-occurrence data and pointwise mutual information; (iii) feature selection at a single site; (iv) building a knowledge network across multiple sites.
Fig. 2
Fig. 2. Word cloud for KESERMGB selected features.
(a) Selected features for Rheumatoid Arthritis (RA); (b) selected features for Ulcerative Colitis (UC). The size of the words is proportional to the absolute coefficients from the embedding regression.
Fig. 3
Fig. 3. Comparison of AUCROCs, AUCPRCs, and F-scores with gold standard labels for adaptive lasso phenotyping algorithms for eight diseases using the main PheCode only (PheCode), all features (FULL), SAFE selected features (SAFE), KESERMGB and KESERINT selected features based on SVD-SPPMI embeddings as well as KESERMGB and KESERINT selected features based on GloVe embeddings.
F-scores are calculated at the cutoff points with the estimated prevalence equal to the population prevalence. The bootstrap based 95% confidence intervals (bars) are shown.
Fig. 4
Fig. 4. Clinical knowledge network for Etanercept.
(a) Knowlege network learned based on KESERINT; (b) knowlege network learned based on KESERVA; (c) knowledge network learned based on KESERMGB.
Fig. 5
Fig. 5. The left panel describes the key steps for learning the embedding vectors: we conduct singular vector decomposition (SVD) on the SPPMI.
The right panel describes the statistical model: the embedding vectors follow a Gaussian graphical model where each node of the graph is represented by the vectors.

References

    1. Lin K, Schneeweiss S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clin. Pharmacol. Ther. 2016;100:147–159. doi: 10.1002/cpt.359. - DOI - PubMed
    1. Goldstein B, Navar A, Pencina M, Ioannidis J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 2017;24:198–208. doi: 10.1093/jamia/ocw042. - DOI - PMC - PubMed
    1. Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 2011;12:417–428. doi: 10.1038/nrg2999. - DOI - PubMed
    1. Denny JC, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 2013;31:1102–1111. doi: 10.1038/nbt.2749. - DOI - PMC - PubMed
    1. Bennett C, Doub T, Selove R. EHRs connect research and practice: where predictive modeling, artificial intelligence, and clinical decision support intersect. Heal. Policy Technol. 2012;1:105–114. doi: 10.1016/j.hlpt.2012.03.001. - DOI