Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb:162:104761.
doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

Affiliations

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

Ziming Gan et al. J Biomed Inform. 2025 Feb.

Abstract

Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

Methods: Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.

Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.3 ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates.

Conclusion: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

Keywords: Electronic health records; Knowledge graph; Natural language processing; Representation learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Update of

References

    1. Halpern Y, Horng S, Choi Y, Sontag D, Electronic medical record phenotyping using the anchor and learn framework, JAMIA 23 (4) (2016) 731–740. - PMC - PubMed
    1. Choi E, Schuetz A, Stewart WF, Sun J, Using recurrent neural network models for early detection of heart failure onset, JAMIA 24 (2) (2017) 361–370. - PMC - PubMed
    1. Christopoulou F, Tran TT, Sahu SK, Miwa M, Ananiadou S, Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods, JAMIA 27 (1) (2020) 39–46. - PMC - PubMed
    1. Jin B, Che C, Liu Z, Zhang S, Yin X, Wei X, Predicting the risk of heart failure with ehr sequential data modeling, IEEE Access 6 (2018) 9256–9261.
    1. McInnes BT, Pedersen T, Carlis J, Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain, in: AMIA Symposium, Vol. 2007, 2007, pp. 533–537. - PMC - PubMed

LinkOut - more resources