This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 May 21:2023.05.14.23289955.

doi: 10.1101/2023.05.14.23289955.

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis

Ziming Gan¹, Doudou Zhou², Everett Rush³, Vidul A Panickan^{4

5}, Yuk-Lam Ho⁵, George Ostrouchov³, Zhiwei Xu⁶, Shuting Shen², Xin Xiong², Kimberly F Greco², Chuan Hong⁷, Clara-Lea Bonzel^{4

5}, Jun Wen⁴, Lauren Costa⁵, Tianrun Cai^{5

8}, Edmon Begoli³, Zongqi Xia⁹, J Michael Gaziano^{4

5

8}, Katherine P Liao^{5

8}, Kelly Cho^{4

5

8}, Tianxi Cai^{2

4

5}, Junwei Lu^{2

5}

Affiliations

¹ University of Chicago, Chicago, IL, USA.
² Harvard T.H. Chan School of Public Health, Boston, MA, USA.
³ Oak Ridge National Laboratory, Oak Ridge, TN USA.
⁴ Harvard Medical School, Boston, MA, USA.
⁵ VA Boston Healthcare System, Boston, MA, USA.
⁶ University of Michigan, Ann Arbor, MI, USA.
⁷ Duke University, Durham, NC, USA.
⁸ Brigham and Women's Hospital, Boston, MA, USA.
⁹ University of Pittsburgh, Pittsburgh, USA.

PMID: 37293026
PMCID: PMC10246054
DOI: 10.1101/2023.05.14.23289955

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis

Ziming Gan et al. medRxiv. 2023.

[Preprint]. 2023 May 21:2023.05.14.23289955.

doi: 10.1101/2023.05.14.23289955.

Authors

Affiliations

¹ University of Chicago, Chicago, IL, USA.
² Harvard T.H. Chan School of Public Health, Boston, MA, USA.
³ Oak Ridge National Laboratory, Oak Ridge, TN USA.
⁴ Harvard Medical School, Boston, MA, USA.
⁵ VA Boston Healthcare System, Boston, MA, USA.
⁶ University of Michigan, Ann Arbor, MI, USA.
⁷ Duke University, Durham, NC, USA.
⁸ Brigham and Women's Hospital, Boston, MA, USA.
⁹ University of Pittsburgh, Pittsburgh, USA.

PMID: 37293026
PMCID: PMC10246054
DOI: 10.1101/2023.05.14.23289955

Update in

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.
Gan Z, Zhou D, Rush E, Panickan VA, Ho YL, Ostrouchovm G, Xu Z, Shen S, Xiong X, Greco KF, Hong C, Bonzel CL, Wen J, Costa L, Cai T, Begoli E, Xia Z, Gaziano JM, Liao KP, Cho K, Cai T, Lu J. Gan Z, et al. J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23. J Biomed Inform. 2025. PMID: 39863245

Abstract

Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

Methods: The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated $p$ -values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.

Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the $p$ -values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate.

Conclusions: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

Keywords: Electronic health records; knowledge graph; natural language processing; representation learning.

PubMed Disclaimer

Figures

**Figure 1:**
Data generation process of the EHR occurrence data. The embeddings of concepts are generated from a graphical model and the occurrence is then driven by a Markov process.

**Figure 2:**
Data source, including codified data and narrative notes, and data analytics pipeline.

**Figure 3:**
Sensitivity of detecting drug-side effects pairs with only codified data and that with both codified data and NLP with ARCH under target FDR 0.05.

**Figure 4:**
The word clouds of the side effects of two sample drugs - (a) Levothyroxine on the left and (b) Hydrocodone on the right. The surrounding words describe side effects. The words colored red are detected using codified data only while the words colored orange or red are detected by using both codified data and NLP codes. The words colored by grey are undetected. The size of the words is determined by the cosine similarity with the target drug code.

**Figure 5:**
The AUC of different phenotyping algorithms trained with different feature sets across 8 diseases.

**Figure 6:**
The KM survival curves for the fast and slow progression groups identified via k-means clustering of the ARCH patient level embeddings.

**Figure 7:**
The word cloud of (a) phenotype features; and (b) drug features that drive the differences between the two subgroups. The size of the feature is determined by the between-group difference in the average intensity of such a feature. Red-colored features represent higher average intensity in the fast progression group and blue-colored features represent higher intensity in the slow progression group.

**Figure 8:**
The network of (a) phenotype features; and (b) drug features that drive the differences between the two subgroups. The size of the feature is determined by the between-group difference in the average intensity of such a feature. Red-colored features represent higher average intensity in the fast progression group and blue-colored features represent higher intensity in the slow progression group.

See this image and copyright information in PMC

References

1. Hong C. et al. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. NPJ digital medicine 4, 1–11 (2021). - PMC - PubMed
1. Halpern Y., Horng S., Choi Y. & Sontag D. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association 23, 731–740 (2016). - PMC - PubMed
1. Choi E., Schuetz A., Stewart W. F. & Sun J. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24, 361–370 (2017). - PMC - PubMed
1. Christopoulou F., Tran T. T., Sahu S. K., Miwa M. & Ananiadou S. Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods. Journal of the American Medical Informatics Association 27, 39–46 (2020). - PMC - PubMed
1. Jin B. et al. Predicting the risk of heart failure with ehr sequential data modeling. IEEE Access 6, 9256–9261 (2018).

Publication types

Actions

Grants and funding

R01 LM013614/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis

Affiliations

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources