An Implemented Real-World-Data Pipeline for Standardization of Electronic Health Records in Precision Oncology
- PMID: 40502240
- PMCID: PMC12150718
An Implemented Real-World-Data Pipeline for Standardization of Electronic Health Records in Precision Oncology
Abstract
Several use cases in precision oncology require accurately extracting and standardizing Real-World Data from Electronic Health Records (EHRs). We developed the infrastructure and a toolset incorporating data mining and natural language processing scripts to automatically retrieve selected descriptive and common endpoint variables from EHRs. This toolset was evaluated against a reference dataset of 106 lung cancer and 45 sarcoma patient cases pulled from two databases complying with the Precision Oncology Core Data Model (Precision-DM) and maintained by the Johns Hopkins Molecular Tumor Board and a research team. We accurately retrieved most descriptive EHR fields but less efficiently extracted the Date of Diagnosis and Treatment Start Date that supported calculating the Age at Diagnosis, Overall Survival, and Time to First Treatment (accuracy range 50%-86%). Our infrastructure and Precision-DM-based standardization could inspire similar efforts in other cancer centers, however, the toolset should be enhanced to improve accuracy in certain variables.
©2025 AMIA - All rights reserved.
Figures


References
-
- Elemento O, Leslie C, Lundin J, Tourassi G. Artificial intelligence in cancer research, diagnosis and therapy. Nat Rev Cancer. 2021;21(12):747–52. - PubMed
-
- Sharpless NE, Kerlavage AR. The potential of AI in cancer care and research. Biochim Biophys Acta Rev Cancer. 2021;1876(1):188573. - PubMed
-
- Cohen AB, Rosic A, Harrison K, Richey M, Nemeth S, Ambwani G, Miksad R, Haaland B, Jiang C. A natural language processing algorithm to improve completeness of ecog performance status in real-world data. Applied Sciences. 2023;13(10):6209.