Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May;29(5):1113-1122.
doi: 10.1038/s41591-023-02332-5. Epub 2023 May 8.

A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories

Affiliations

A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories

Davide Placido et al. Nat Med. 2023 May.

Abstract

Pancreatic cancer is an aggressive disease that typically presents late with poor outcomes, indicating a pronounced need for early detection. In this study, we applied artificial intelligence methods to clinical data from 6 million patients (24,000 pancreatic cancer cases) in Denmark (Danish National Patient Registry (DNPR)) and from 3 million patients (3,900 cases) in the United States (US Veterans Affairs (US-VA)). We trained machine learning models on the sequence of disease codes in clinical histories and tested prediction of cancer occurrence within incremental time windows (CancerRiskNet). For cancer occurrence within 36 months, the performance of the best DNPR model has area under the receiver operating characteristic (AUROC) curve = 0.88 and decreases to AUROC (3m) = 0.83 when disease events within 3 months before cancer diagnosis are excluded from training, with an estimated relative risk of 59 for 1,000 highest-risk patients older than age 50 years. Cross-application of the Danish model to US-VA data had lower performance (AUROC = 0.71), and retraining was needed to improve performance (AUROC = 0.78, AUROC (3m) = 0.76). These results improve the ability to design realistic surveillance programs for patients at elevated risk, potentially benefiting lifespan and quality of life by early detection of this aggressive cancer.

PubMed Disclaimer

Conflict of interest statement

S.B. has ownership in Intomics A/S, Hoba Therapeutics Aps, Novo Nordisk A/S, Lundbeck A/S and ALK Abello and has managing board memberships in Proscion A/S and Intomics A/S. B.M.W. notes grant funding from Celgene and Eli Lilly and consulting fees from BioLineRx, Celgene and GRAIL. A.R. is a co-founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas and was a scientific advisory board member of Thermo Fisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov until 31 July 2020. D.S.M. is an advisor for Dyno Therapeutics, Octant, Jura Bio, Tectonic Therapeutic and Genentech and is a co-founder of Seismic Therapeutic. C.S. is on the scientific advisory board of CytoReason. From 1 August 2020, A.R. is an employee of Genentech and has equity in Roche. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Training and prediction of pancreatic cancer risk from disease trajectories.
a, Learning: The general ML workflow starts with partitioning the data into a training set (Train), a development set (Dev) and a test set (Test). The trajectories for training input are generated by sampling continuous subsequences of diagnoses for each patient’s diagnosis history, each starting with the first record but with different endpoints. The training and development sets are used for training so as to minimize the prediction error—that is, the difference between a risk score function (prediction) and a step function (observation), summed over all instances. Prediction: A model’s ability to accurately predict is evaluated using the withheld test set. The prediction model, depending on the prediction threshold selected from among possible operational points, discriminates between patients at higher and lower risk of pancreatic cancer. The risk model can guide the development of surveillance initiatives. b, The model trained with real-world clinical data has three steps: embedding, encoding and prediction. The embedding machine transforms categorical disease codes and timestamps of these disease codes into a lower-dimensional real number continuous space. The encoding machine extracts information from a disease history and summarizes each sequence in a characteristic fingerprint in the latent space (vertical vector). The prediction machine then uses the fingerprint to generate predictions for cancer occurrence within different time intervals after the time of assessment (3, 6, 12, 36 and 60 months). The model parameters are trained by minimizing the difference between the predicted and the observed cancer occurrence. c, Terminology for timepoints and intervals. The last event of a disease trajectory coincides with the time of assessment. From the time of assessment, cancer risk is assessed within 3, 6, 12, 36 and 60 months. To test the influence of close-to-cancer diagnosis codes on the prediction of cancer occurrence, exclusion intervals are used to remove diagnoses in the last 3, 6 and 12 months before cancer diagnosis.
Fig. 2
Fig. 2. Characteristics of the Danish and US-VA patient registries.
a, Distributions for age at pancreatic cancer diagnosis in the two cohorts. b,c, The Danish (DK) dataset has a longer median length of disease trajectories but lower median number of disease codes per patient compared to the US-VA dataset, so the ML process, independently in each dataset, has to cope with very different distributions of disease trajectories in terms of length of trajectories and density of the number of disease codes. Color level indicates the number of patients in a given bin. d,e, Background check on the distribution disease codes in the clinical records: prevalence of known risk factors in cancer versus non-cancer patients in the DK (d) and US-VA (e) datasets, counting whether a disease code occurred at least once in a patient’s history previous to their pancreatic cancer code (cancer) or 2 years previous to the end of data (no cancer).
Fig. 3
Fig. 3. Performance of the ML model on clinical record trajectories in predicting pancreatic cancer occurrence in the Danish dataset.
For each model and prediction evaluation, performance is better for larger AUROC (a,c,e,g) and for higher RR (Relative risk) for the n (horizontal axis) highest-risk patients (b,d,f,h). a,b, Choice of algorithm: The Transformer algorithm is best with AUROC = 0.879 (no data exclusion, 36-month prediction interval). c,d, Choice of input data: Prediction performance declines with exclusion interval, in training, of k = 3, 6 and 12 months of data between the end of a disease trajectory and cancer occurrence (best model for each exclusion interval, for 36-month prediction interval). e,f, Choice of input data: Prediction is better for all 2,000 ICD level-3 disease codes used throughout in training (Methods) compared to only the subset of 23 known risk factors, using a Transformer, all data (Exclusion 0), for the 36-month prediction interval. g,h, Choice of prediction task: Prediction of cancer is more difficult for larger prediction intervals, the time interval within which cancer is predicted to occur after assessment (Transformer model, all data). We reported prediction performance for the 36-month prediction interval (orange in g and h) in the above panels (a-f), as this is a reasonable choice for design of a surveillance program in clinical practice. b,d,f,h, Prediction performance at a particular operational point—for example (d), for n = 1,000 highest-risk patients (vertical dotted line) out of 1 million (1M) patients, the RR is 104.7 for the 36-month prediction interval using all data and 47.6 with 3-month data exclusion.
Fig. 4
Fig. 4. Estimated performance of a surveillance program for high-risk patients in different health systems and with different operational choices.
Estimated relative risk (RR) for the top n (horizontal axis) high-risk patients is based on evaluating the accuracy of prediction on the withheld test set (a,c,d) and on a full external dataset (b). a,c, In designing surveillance programs, one can choose between models trained on all data (Exclusion 0) versus models trained excluding data from the last 3 months before cancer occurrence (Exclusion 3) and between prediction for cancer within 12 months or 36 months of assessment (legend top right in each panel). b, Estimated performance is somewhat lower for cross-application of a model trained on Danish (DK) data applied to US-VA patient data, illustrating the challenge of deriving globally valid prediction tools without independent localized or system-specific training. d, A proposed practical choice for a surveillance program with good estimated accuracy of prediction, in either system, would involve application of independently trained models with 3-month data exclusion for a prediction interval of 12 months for patients older than age 50 years. 1M, 1 million.
Fig. 5
Fig. 5. Predictive capacity and feature contributions of disease trajectories.
a,c, Distribution of recall (sensitivity) values at the F1 operational point (Methods) as a function of time to cancer (time between the end of a disease trajectory and cancer diagnosis). As expected, recall levels decrease with longer time to cancer, from 8% for cancer occurring about 1 year after assessment to a recall of 4% for cancer occurring about 3 years after assessment (DNPR). This suggests that the model learns not only from symptoms very close to pancreatic cancer but also from longer disease histories, albeit at lower accuracy. a, Danish system (DK), for models trained on all data (no data exclusion). c, US-VA system, for models trained on all data. b,d, Top 10 features that contribute to the cancer prediction in time-to-cancer intervals of 0–6, 6–12, 12–24 and 24–36 months for the Danish (DK) (b) and US-VA (d) systems. The features are sorted by the contribution score (Supplementary Table 5). We used an integrated gradients (IG) method to calculate the contribution score for each input feature for each trajectory and then summed over all trajectories with cancer diagnosis within the indicated time interval.
Extended Data Fig. 1
Extended Data Fig. 1. Preprocessing and filtering of the DK disease trajectory datasets.
Filtering of the Danish (DK-DNPR) patient registries prior to training. In the Danish dataset, patient status codes were used to remove discontinuous disease histories such as patients living in Greenland, patients with alterations in their patient ID or patients who lack a stable residence in Denmark. We also removed referral and temporary diagnosis codes which are not the final diagnosis codes and can be misleading to use for training. Patients with short trajectories (<5 diagnosis codes) were removed. The final set of patients were split into Training (80 %), Validation (10%) and Testing set (10%).
Extended Data Fig. 2
Extended Data Fig. 2. Distribution of disease codes as a function of age in the DNPR (Denmark) database.
Distribution of disease codes for a representative subset of diseases known to contribute to the risk of pancreatic cancer, as a fraction of all pancreatic cancer patients (orange) and all non-cancer patients (blue). The similarity of the distributions for some of these diseases with the distribution of occurrence of pancreatic cancer (red line, Gaussian fit to cancer diagnosis data) is consistent with either a direct or indirect contribution to cancer risk - but not taken as evidence in this work. The disease codes are ICD-10/ICD-8.
Extended Data Fig. 3
Extended Data Fig. 3. Preprocessing and filtering of US-VA disease trajectory datasets.
Filtering of the US-VA patient registries prior to training. For the US-VA dataset, around 3 million of patients were randomly sampled due to computational limitations and patients with ICD-9/10 for pancreatic cancer, but without entries in US-VA cancer registry were excluded. Similar to the Danish dataset filtering, short trajectories (<5 diagnosis codes) were removed and patients were split into Training (80 %), Validation (10%) and Test set (10%).
Extended Data Fig. 4
Extended Data Fig. 4. Age as a contributing factor.
The integrated gradient method was used to extract the contribution (arbitrary units) of patient age to the prediction at the time of assessment. This confirmed that the positive contribution to risk rises strongly from age 50. As for the disease contributions, the age contribution was calculated in relation to the 3 year (after the time of assessment/prediction) cancer risk.
Extended Data Fig. 5
Extended Data Fig. 5. Risk factor for patients without chronic pancreatitis.
To assess to what extent the inclusion of people with chronic pancreatitis might boost model performance artificially, we have evaluated model performance for predicting cancer within 12 months for all patients above the age of 50, excluding data from the last three months before the PC diagnosis, for all cases without chronic pancreatitis and closely related conditions (ICD10 codes K86), for comparison with Fig. 4d in the paper. Result: the relative risk remains nearly the same, proving that including patients with chronic pancreatitis does not affect model performance. Moreover, this is supporting evidence for the robustness of the model that bases its prediction not on single diagnoses but rather on the entire set of codes in disease trajectories.
Extended Data Fig. 6
Extended Data Fig. 6. Survival curves.
Overall survival in each dataset, stratified by cancer stage. Cases were ascertained using the methods described in Methods, and cancer stage was obtained from the respective dataset’s cancer registry. Stage was only available on a subset of patients.

References

    1. Rahib L, et al. Projecting cancer incidence and deaths to 2030: the unexpected burden of thyroid, liver, and pancreas cancers in the United States. Cancer Res. 2014;74:2913–2921. - PubMed
    1. McGuigan A, et al. Pancreatic cancer: a review of clinical diagnosis, epidemiology, treatment and outcomes. World J. Gastroenterol. 2018;24:4846–4861. - PMC - PubMed
    1. Amundadottir L, et al. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat. Genet. 2009;41:986–990. - PMC - PubMed
    1. Petersen GM, et al. A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nat. Genet. 2010;42:224–228. - PMC - PubMed
    1. Li D, et al. Pathway analysis of genome-wide association study data highlights pancreatic development genes as susceptibility factors for pancreatic cancer. Carcinogenesis. 2012;33:1384–1390. - PMC - PubMed

Publication types