Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 10:2025:242-249.
eCollection 2025.

An Implemented Real-World-Data Pipeline for Standardization of Electronic Health Records in Precision Oncology

Affiliations

An Implemented Real-World-Data Pipeline for Standardization of Electronic Health Records in Precision Oncology

Kory Kreimeyer et al. AMIA Jt Summits Transl Sci Proc. .

Abstract

Several use cases in precision oncology require accurately extracting and standardizing Real-World Data from Electronic Health Records (EHRs). We developed the infrastructure and a toolset incorporating data mining and natural language processing scripts to automatically retrieve selected descriptive and common endpoint variables from EHRs. This toolset was evaluated against a reference dataset of 106 lung cancer and 45 sarcoma patient cases pulled from two databases complying with the Precision Oncology Core Data Model (Precision-DM) and maintained by the Johns Hopkins Molecular Tumor Board and a research team. We accurately retrieved most descriptive EHR fields but less efficiently extracted the Date of Diagnosis and Treatment Start Date that supported calculating the Age at Diagnosis, Overall Survival, and Time to First Treatment (accuracy range 50%-86%). Our infrastructure and Precision-DM-based standardization could inspire similar efforts in other cancer centers, however, the toolset should be enhanced to improve accuracy in certain variables.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The descriptive EHR data fields and common endpoints listed in the NCI’s mechanism and their collection status in the selected use cases.
Figure 2.
Figure 2.
The survival plots based on the reference versus the calculated Overall Survival values when either the Date of Diagnosis or the Treatment Start Date were used as the baseline.

References

    1. Elemento O, Leslie C, Lundin J, Tourassi G. Artificial intelligence in cancer research, diagnosis and therapy. Nat Rev Cancer. 2021;21(12):747–52. - PubMed
    1. Sharpless NE, Kerlavage AR. The potential of AI in cancer care and research. Biochim Biophys Acta Rev Cancer. 2021;1876(1):188573. - PubMed
    1. Savova GK, Danciu I, Alamudun F, Miller T, Lin C, Bitterman DS, Tourassi G, Warner JL. Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records. Cancer Res. 2019;79(21):5463–70. - PMC - PubMed
    1. Caccamisi A, Jorgensen L, Dalianis H, Rosenlund M. Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records. Ups J Med Sci. 2020;125(4):316–24. - PMC - PubMed
    1. Cohen AB, Rosic A, Harrison K, Richey M, Nemeth S, Ambwani G, Miksad R, Haaland B, Jiang C. A natural language processing algorithm to improve completeness of ecog performance status in real-world data. Applied Sciences. 2023;13(10):6209.