ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

Ziming Gan¹, Doudou Zhou², Everett Rush³, Vidul A Panickan⁴, Yuk-Lam Ho⁵, George Ostrouchovm³, Zhiwei Xu⁶, Shuting Shen⁷, Xin Xiong⁸, Kimberly F Greco⁸, Chuan Hong⁷, Clara-Lea Bonzel⁴, Jun Wen⁹, Lauren Costa⁵, Tianrun Cai¹⁰, Edmon Begoli³, Zongqi Xia¹¹, J Michael Gaziano¹², Katherine P Liao¹⁰, Kelly Cho¹², Tianxi Cai¹³, Junwei Lu¹⁴

Affiliations

¹ Department of Statistics, University of Chicago, 5801 S Ellis Ave, Chicago, 60615, IL, USA.
² Department of Statistics and Data Science, National University of Singapore, 117546, Singapore.
³ Oak Ridge national Laboratory, Bethel Valley Rd, Oak Ridge, 37830, TN, USA.
⁴ Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA.
⁵ VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA.
⁶ Department of Statistics, University of Michigan, 500 S State St, Ann Arbor, 48109, MI, USA.
⁷ Department of Biostatistics & Bioinformatics, Duke University, 1121 West Main St, Durham, 27708, NC, USA.
⁸ Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA.
⁹ Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA.
¹⁰ VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA.
¹¹ Clinical and Translational Science, University of Pittsburgh, 3501 Fifth Avenue, Pittsburgh, 15260, PA, USA.
¹² Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA.
¹³ Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA.
¹⁴ VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA. Electronic address: junweilu@hsph.harvard.edu.

PMID: 39863245
PMCID: PMC12066163 (available on 2026-02-01)
DOI: 10.1016/j.jbi.2024.104761

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

Ziming Gan et al. J Biomed Inform. 2025 Feb.

. 2025 Feb:162:104761.

doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

Authors

Affiliations

¹ Department of Statistics, University of Chicago, 5801 S Ellis Ave, Chicago, 60615, IL, USA.
² Department of Statistics and Data Science, National University of Singapore, 117546, Singapore.
³ Oak Ridge national Laboratory, Bethel Valley Rd, Oak Ridge, 37830, TN, USA.
⁴ Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA.
⁵ VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA.
⁶ Department of Statistics, University of Michigan, 500 S State St, Ann Arbor, 48109, MI, USA.
⁷ Department of Biostatistics & Bioinformatics, Duke University, 1121 West Main St, Durham, 27708, NC, USA.
⁸ Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA.
⁹ Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA.
¹⁰ VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA.
¹¹ Clinical and Translational Science, University of Pittsburgh, 3501 Fifth Avenue, Pittsburgh, 15260, PA, USA.
¹² Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Brigham and Women's Hospital, 60 Fenwood Rd, Boston, 02115, MA, USA.
¹³ Harvard Medical School, 25 Shattuck St, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA.
¹⁴ VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA; Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA. Electronic address: junweilu@hsph.harvard.edu.

PMID: 39863245
PMCID: PMC12066163 (available on 2026-02-01)
DOI: 10.1016/j.jbi.2024.104761

Abstract

Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

Methods: Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients.

Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.³ ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates.

Conclusion: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

Keywords: Electronic health records; Knowledge graph; Natural language processing; Representation learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Update of

ARCH: Large-scale Knowledge Graph via Aggregated Narrative Codified Health Records Analysis.
Gan Z, Zhou D, Rush E, Panickan VA, Ho YL, Ostrouchov G, Xu Z, Shen S, Xiong X, Greco KF, Hong C, Bonzel CL, Wen J, Costa L, Cai T, Begoli E, Xia Z, Gaziano JM, Liao KP, Cho K, Cai T, Lu J. Gan Z, et al. medRxiv [Preprint]. 2023 May 21:2023.05.14.23289955. doi: 10.1101/2023.05.14.23289955. medRxiv. 2023. Update in: J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. PMID: 37293026 Free PMC article. Updated. Preprint.

References

1. Halpern Y, Horng S, Choi Y, Sontag D, Electronic medical record phenotyping using the anchor and learn framework, JAMIA 23 (4) (2016) 731–740. - PMC - PubMed
1. Choi E, Schuetz A, Stewart WF, Sun J, Using recurrent neural network models for early detection of heart failure onset, JAMIA 24 (2) (2017) 361–370. - PMC - PubMed
1. Christopoulou F, Tran TT, Sahu SK, Miwa M, Ananiadou S, Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods, JAMIA 27 (1) (2020) 39–46. - PMC - PubMed
1. Jin B, Che C, Liu Z, Zhang S, Yin X, Wei X, Predicting the risk of heart failure with ehr sequential data modeling, IEEE Access 6 (2018) 9256–9261.
1. McInnes BT, Pedersen T, Carlis J, Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain, in: AMIA Symposium, Vol. 2007, 2007, pp. 533–537. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

Affiliations

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources