. 2022 Jul:131:104095.

doi: 10.1016/j.jbi.2022.104095. Epub 2022 May 20.

Deep learning on time series laboratory test results from electronic health records for early detection of pancreatic cancer

Jiheum Park¹, Michael G Artin¹, Kate E Lee¹, Yoanna S Pumpalova¹, Myles A Ingram¹, Benjamin L May², Michael Park³, Chin Hur⁴, Nicholas P Tatonetti⁵

Affiliations

¹ Department of Medicine, Columbia University Irving Medical Center, New York, NY, United States.
² Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, United States.
³ Applied Info Partners Inc, Worlds Fair Drive, Somerset, NJ, United States; X-Mechanics LLC, Cresskill, NJ, United States.
⁴ Department of Medicine, Columbia University Irving Medical Center, New York, NY, United States. Electronic address: ch447@cumc.columbia.edu.
⁵ Department of Biomedical Informatics, Columbia University, New York, NY, United States.

PMID: 35598881
PMCID: PMC10286873
DOI: 10.1016/j.jbi.2022.104095

Deep learning on time series laboratory test results from electronic health records for early detection of pancreatic cancer

Jiheum Park et al. J Biomed Inform. 2022 Jul.

. 2022 Jul:131:104095.

doi: 10.1016/j.jbi.2022.104095. Epub 2022 May 20.

Authors

Jiheum Park¹, Michael G Artin¹, Kate E Lee¹, Yoanna S Pumpalova¹, Myles A Ingram¹, Benjamin L May², Michael Park³, Chin Hur⁴, Nicholas P Tatonetti⁵

Affiliations

¹ Department of Medicine, Columbia University Irving Medical Center, New York, NY, United States.
² Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, United States.
³ Applied Info Partners Inc, Worlds Fair Drive, Somerset, NJ, United States; X-Mechanics LLC, Cresskill, NJ, United States.
⁴ Department of Medicine, Columbia University Irving Medical Center, New York, NY, United States. Electronic address: ch447@cumc.columbia.edu.
⁵ Department of Biomedical Informatics, Columbia University, New York, NY, United States.

PMID: 35598881
PMCID: PMC10286873
DOI: 10.1016/j.jbi.2022.104095

Abstract

The multi-modal and unstructured nature of observational data in Electronic Health Records (EHR) is currently a significant obstacle for the application of machine learning towards risk stratification. In this study, we develop a deep learning framework for incorporating longitudinal clinical data from EHR to infer risk for pancreatic cancer (PC). This framework includes a novel training protocol, which enforces an emphasis on early detection by applying an independent Poisson-random mask on proximal-time measurements for each variable. Data fusion for irregular multivariate time-series features is enabled by a "grouped" neural network (GrpNN) architecture, which uses representation learning to generate a dimensionally reduced vector for each measurement set before making a final prediction. These models were evaluated using EHR data from Columbia University Irving Medical Center-New York Presbyterian Hospital. Our framework demonstrated better performance on early detection (AUROC 0.671, CI 95% 0.667 - 0.675, p < 0.001) at 12 months prior to diagnosis compared to a logistic regression, xgboost, and a feedforward neural network baseline. We demonstrate that our masking strategy results greater improvements at distal times prior to diagnosis, and that our GrpNN model improves generalizability by reducing overfitting relative to the feedforward baseline. The results were consistent across reported race. Our proposed algorithm is potentially generalizable to other diseases including but not limited to cancer where early detection can improve survival.

Keywords: Early detection of cancer; Electronic Health Records; Machine learning; Pancreatic cancer.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Chin Hur reports financial support was provided by National Cancer Institute of the National Institutes of Health.

Figures

**Fig. 1.**
Data preprocessing flow chart. We obtained 458,252 patient samples with 30,195 lab variables from CUIMC-NYP EHR data using our data retrieval criteria (Table 1) and processed them into the final dataset composed of patient cohort where PC patients who received pancreatitis diagnosis before PC and nonPC patients with pancreatitis were eliminated (Total 9,057 patients where 834 are PC patients).

**Fig. 2.**
The evaluation of data structure discrepancies between PC and nonPC. (A) Among the selected PC (1,200) and nonPC (161,849), we investigated data structure in terms of data completeness and average number of measurements for each variable and their discrepancies between PC and nonPC, which can introduce bias in the prediction model. We sorted the 418 lab variables by data completeness of PC group and the panel A shows data structure of top 50 lab variables. (B) We selected the first 33 (panel A gray dotted line) lab variables, where the measurement system changes from Cerner system to NYP. The discrepancies in data completeness between PC and nonPC decreased when we filtered patients with the patient group who have at least one of the selected 33 lab variables. We then assigned random diagnosis dates for nonPC patients and configured the nonPC dataset into pre-diagnosis based on the average percentage reduction calculated from the process of configuring PC dataset into pre-diagnosis data. This led the average number of measurements of nonPC group close to that of PC group as shown in panel B.

**Fig. 3.**
A grouped deep neural network (GrpNN) incorporated with a random masking strategy. (A) GrpNN, where time series measurements of thirty three lab variables individually pass through the embedding network, producing a new representation of each variable in 4-dimensions, which is then put together (i.e., merge) to pass through the prediction network for making a prediction. (B) The process of the random masking strategy applied to each variable, incorporated during GrpNN training. The random samples are generated every training epoch and applied to the batch data before fed into GrpNN model. (C) The resultant masked histograms for the exemplary lab variables (‘Creatinine’ and ‘Magnesium level’) show reduced discrepancies in the number of lab measurements between PC and nonPC.

**Fig. 4.**
The propensity score matching. The baseline characteristics are potential confounders as they can be reflected in lab measurements. In the dataset composed of 126,655 nonPC and 835 PC (Fig. 1) before the propensity matching procedure, the separability resulting from the baseline characteristics (i.e., race, ethnicity, sex, zip code, patient language, age, smoking, obesity, diabetes) were 72.9%. By applying propensity score matching, we reduced separability of our final dataset to 54.6%.

**Fig. 5.**
Early detection performance. We evaluated early detection performance of GrpNN model with and without random masking strategy (GrpNN vs GrpNN+RM). (A) A flowchart for describing dataset preparation for estimating prediction score at x months prior to diagnosis date. (B) The early detection performance significantly improves with the random masking strategy. (C) The post hoc analysis of GrpNN+RM on the hold-out set stratified by race. The early detection performance was consistent across reported race. The abbreviations for the race, [W, A, B, O, U], represent [White, Asian, Black, Other Combinations not described, Unknown].

See this image and copyright information in PMC

References

1. Rahib L, Smith BD, Aizenberg R, Rosenzweig AB, Fleshman JM, Matrisian LM, Projecting cancer incidence and deaths to 2030: the unexpected burden of thyroid, liver, and pancreas cancers in the United States, Cancer Res. 74 (11) (2014) 2913–2921. - PubMed
1. Wagner M, Redaelli C, Lietz M, Seiler CA, Friess H, Büchler MW, Curative resection is the single most important factor determining outcome in patients with pancreatic adenocarcinoma, Br. J. Surg 91 (5) (2004) 586–594. - PubMed
1. Goggins M, Overbeek KA, Brand R, Syngal S, Del Chiaro M, Bartsch DK, Bassi C, Carrato A, Farrell J, Fishman EK, Fockens P, Gress TM, van Hooft JE, Hruban RH, Kastrinos F, Klein A, Lennon AM, Lucas A, Park W, Rustgi A, Simeone D, Stoffel E, Vasen HFA, Cahen DL, Canto MI, Bruno M, Management of patients with increased risk for familial pancreatic cancer: updated recommendations from the International Cancer of the Pancreas Screening (CAPS) Consortium, Gut 69 (1) (2020) 7–17. - PMC - PubMed
1. Daly MB, Pal T, Berry MP, Buys SS, Dickson P, Domchek SM, Elkhanany A, Friedman S, Goggins M, Hutton ML, Karlan BY, Khan S, Klein C, Kohlmann W, Kurian AW, Laronga C, Litton JK, Mak JS, Menendez CS, Merajver SD, Norquist BS, Offit K, Pederson HJ, Reiser G, Senter-Jamieson L, Shannon KM, Shatsky R, Visvanathan K, Weitzel JN, Wick MJ, Wisinski KB, Yurgelun MB, Darlow SD, Dwyer MA, Genetic/Familial High-Risk Assessment: Breast, Ovarian, and Pancreatic, Version 2.2021, NCCN Clinical Practice Guidelines in Oncology, J. Natl. Compr. Canc. Netw 19 (1) (2021) 77–102. - PubMed
1. Stoffel EM, McKernin SE, Brand R, Canto M, Goggins M, Moravek C, Nagarajan A, Petersen GM, Simeone DM, Yurgelun M, Khorana AA, Evaluating Susceptibility to Pancreatic Cancer: ASCO Provisional Clinical Opinion, J. Clin. Oncol 37 (2) (2019) 153–164. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R21 CA265400/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Deep learning on time series laboratory test results from electronic health records for early detection of pancreatic cancer

Affiliations

Deep learning on time series laboratory test results from electronic health records for early detection of pancreatic cancer

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical