Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec;636(8043):728-736.
doi: 10.1038/s41586-024-08167-5. Epub 2024 Nov 6.

Automated real-world data integration improves cancer outcome prediction

Justin Jee #  1 Christopher Fong #  1 Karl Pichotta #  1 Thinh Ngoc Tran #  1 Anisha Luthra #  1 Michele Waters  1 Chenlian Fu  1 Mirella Altoe  1 Si-Yang Liu  1 Steven B Maron  1   2 Mehnaj Ahmed  1 Susie Kim  1 Mono Pirun  1 Walid K Chatila  1 Ino de Bruijn  1 Arfath Pasha  1 Ritika Kundra  1 Benjamin Gross  1 Brooke Mastrogiacomo  1 Tyler J Aprati  2 David Liu  2 JianJiong Gao  3 Marzia Capelletti  3 Kelly Pekala  1 Lisa Loudon  1 Maria Perry  1 Chaitanya Bandlamudi  1 Mark Donoghue  1 Baby Anusha Satravada  1 Axel Martin  1 Ronglai Shen  1 Yuan Chen  1 A Rose Brannon  1 Jason Chang  1 Lior Braunstein  1   2 Anyi Li  1 Anton Safonov  1 Aaron Stonestrom  1 Pablo Sanchez-Vela  1 Clare Wilhelm  1 Mark Robson  1   2 Howard Scher  1   2 Marc Ladanyi  1 Jorge S Reis-Filho  1 David B Solit  1 David R Jones  1 Daniel Gomez  1 Helena Yu  1 Debyani Chakravarty  1 Rona Yaeger  1   4 Wassim Abida  1   4 Wungki Park  1   4 Eileen M O'Reilly  1   4 Julio Garcia-Aguilar  1   4 Nicholas Socci  1 Francisco Sanchez-Vega  1 Jian Carrot-Zhang  1 Peter D Stetson  1 Ross Levine  1   4 Charles M Rudin  1   4 Michael F Berger  1 Sohrab P Shah  1 Deborah Schrag  1   4 Pedram Razavi  1   4 Kenneth L Kehl  2 Bob T Li  1   4 Gregory J Riely  1   4 Nikolaus Schultz  5 MSK Cancer Data Science Initiative Group
Collaborators, Affiliations

Automated real-world data integration improves cancer outcome prediction

Justin Jee et al. Nature. 2024 Dec.

Abstract

The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.

PubMed Disclaimer

Conflict of interest statement

Competing interests: S.B.M. declares professional services and activities for Amgen, Clinical Care Options, Daiichi Sankyo, Elevation Oncology, MedPage Today, Novartis, Physicians’ Education Resource, Pinetree Therapeutics, Purple Biotech and Vindico Medical Education; and equity in McKesson. L.B. declares professional services and activities for the Cancer Prevention & Research Institute of Texas. M.R. declares professional services and activities (uncompensated) for Artios Pharma, AstraZeneca, Foundation Medicine, Pfizer and Tempus Labs; and professional services and activities for Change Healthcare, Clinical Education Alliance, Genome Quebec, MJH Associates and myMedEd. M.L. declares equity in and professional services and activities (uncompensated) for Paige.AI. D.B.S. declares professional services and activities for American Association for Cancer Research, BridgeBio, Fog Pharmaceuticals, Paige.AI, Pfizer, Rain Therapeutics; and equity in and professional services and activities for Elsie Biotechnologies, Fore Biotherapeutics, Function Oncology, Pyramid Biosciences and Scorpion Therapeutics. D.R.J. declares professional services and activities for AstraZeneca, Dava Oncology and MORE Health; and professional services and activities (uncompensated) for Merck & Co. D.G. declares professional services and activities for AstraZeneca, Grail, Johnson & Johnson, Med Learning Group, Medtronic and Varian Medical Systems. H.Y. declares professional services and activities for AbbVie, AstraZeneca, Black Diamond Therapeutics, Blueprint Medicines, C4 Therapeutics, Daiichi Sankyo, Ipsen Pharma, Janssen Pharmaceuticals, Taiho and Takeda Pharmaceuticals. R.Y. declares professional services and activities for Mirati Therapeutics and Zai Lab. W.A. declares professional services and activities for AstraZeneca, Clinical Education Alliance, Janssen Oncology and Touch Independent Medical Education. W.P. declares professional services and activities for Astellas. J.G.-A. declares professional services and activities for Ethicon; and equity in and professional services and activities for Intuitive Surgical. P.D.S. declares professional services and activities for the National Comprehensive Cancer Network and the National Institutes of Health. R. Levine declares equity, a fiduciary role or position and intellectual property rights in and professional services and activities (uncompensated) for Ajax Therapeutics; equity in Anovia Biosciences, Bakx Therapeutics, Epiphanes, Imago Biosciences and Syndax; professional services and activities for AstraZeneca, Genome Quebec, Goldman Sachs, Incyte, Janssen Pharmaceuticals and Jubilant Therapeutics; equity in and professional services and activities (uncompensated) for Auron Therapeutics and the Isoplexis Corporation; equity in and professional services and activities for C4 Therapeutics, Kurome Therapeutics, Mana Therapeutics, Mission Bio, Prelude Therapeutics, Scorpion Therapeutics, Zentalis Pharmaceuticals; intellectual property rights in the Cure Breast Cancer Foundation and Epizyme; professional services and activities (uncompensated) for the ECOG-ACRIN Cancer Research Group; equity and a fiduciary role or position in and professional services and activities (uncompensated) for Qiagen; and a fiduciary role or position in and professional services and activities for The Mark Foundation. C.M.R. declares professional services and activities for Amgen, AstraZeneca, Bridge Medicines, D2G Oncology, Harpoon Therapeutics and Jazz Pharmaceuticals; intellectual property rights in Daiichi Sankyo; and equity in Earli. M.F.B. declares professional services and activities for AstraZeneca and Paige.AI; professional services and activities (uncompensated) for JCO Precision Oncology and the Journal of Molecular Diagnostics; and intellectual property rights in SOPHiA GENETICS. P.R. declares professional services and activities for Biovica, Inivata, Novartis, Prelude Therapeutics and SAGA Diagnostics; professional services and activities (uncompensated) for Guardant Health, Paige.AI and Tempus Labs; and equity, a fiduciary role or position and intellectual property rights in Odyssey Biosciences. B.T.L. declares professional services and activities (uncompensated) for Amgen, the Asia Society, AstraZeneca, Bolt Biotherapeutics and Daiichi Sankyo; and intellectual property rights in Karger Publishers and Shanghai Jiao Tong University Press. G.J.R. declares professional services and activities (uncompensated) for the American Association for Cancer Research, the American Society of Clinical Oncology, Mirati Therapeutics, Pfizer, Takeda Pharmaceuticals and Verastem; and professional services and activities for Harborside Press, MJH Associates, the National Comprehensive Cancer Network, Phillips Gilmore Oncology Communications, Research to Practice and Triptych Health Partners. H.S. declares professional services and activities for Bayer, Pfizer, Regeneron Pharmaceuticals, Sanofi and WCG Oncology; and intellectual property rights in Elucida Oncology. J.S.R.-F. is an employee of AstraZeneca, has served as a consultant for Goldman Sachs, Paige.AI and REPARE Therapeutics; and has served as an adviser for Roche, Genentech, Roche Tissue Diagnostics, Ventana, Novartis, InVicro, GRAIL, Goldman Sachs, Paige.AI and Volition RX. J. Gao and M.C. are employees of Caris.

Figures

Fig. 1
Fig. 1. Study overview.
a, Creating MSK-CHORD. p, probability; DFCI, Dana Farber Cancer Institute; UHN, University Health Network; VICC, Vanderbilt-Ingram Cancer Center. b, NLP model library performance assessed by either cross-validation or held-out validation in the MSK-BPC cohort (Methods). Source text includes radiology impressions (R), medical oncology notes (M) or histopathology reports (P). Randomly selected false positive (FP) and false negative (FN) cases were independently reviewed to audit reasons for model failure; in several cases (purple), the original curation labels were incorrect. Raw numbers are given in Supplementary Table 1. NA, not applicable; that is, an independent curator determined that the source document did not actually contain sufficient information to determine the status of the variable in question. c, MSK-CHORD characteristics overview. Age box plots show median, quartiles and ±95th percentile. Bar charts show proportion of patients with a given feature. Genomic alterations include only those annotated as oncogenic by OncoKB and were derived from tumour biopsy sequencing by MSK-IMPACT. Age, sex (male reference) and survival outcomes were derived from structured data. Kaplan–Meier survival curves for the individual cohorts are shown with median survival denoted by a red hash mark. Bar charts represent the percentage of patients with a given characteristic at time of cohort entry. Additional characteristics in MSK-CHORD such as tumour stage, specific institutional treatments and tumour markers are not shown. d, Visualizing patient-level data in cBioPortal, in this case a patient (P-0050196) with prostate adenocarcinoma who was treated with definitive radiation for stage III disease, and then developed metastatic recurrence in the lung and received treatment with multiple lines of therapy including pembrolizumab for MSI found on MSK-IMPACT. m, months; PSA, prostate-specific antigen; AJCC, American Joint Committee on Cancer; RT, radiation therapy.
Fig. 2
Fig. 2. Using MSK-CHORD for adequately powered clinicogenomic analysis.
a, Kaplan–Meier curves depicting OS and hazard ratios for patients with NSCLC treated with immune checkpoint blockade at time of cohort entry to time of death, stratified by PDL1 status in the MSK-BPC cohort and MSK-CHORD cohort. b, Left: odds ratios ± 95% CI for known post-treatment alterations in the smaller, manually curated MSK-BPC cohort. Right: odds ratios ± 95% CI for known post-treatment alterations in the MSK-CHORD cohort, stratified by NLP-identified or institutionally given prior treatment (tx) or both. inst., institutional. Clonal haematopoiesis (CH) analyses are performed using a subset of MSK-CHORD with previously published clonal haematopoiesis calls. *0/34 patients with breast cancer without prior treatment in MSK-BPC had ESR1 alterations, and 0/30 patients with EGFR-mutant (EGFRm) NSCLC without prior treatment had MET alterations; hence, the odds ratio is infinity for these groups. c, Proportion of patients with prostate cancer with the listed gene alterations (oncogenic by OncoKB) as a function of Gleason score (NLP-derived) in the MSK-BPC cohort (n = 561) and MSK-CHORD cohort (n = 3,211). Volcano plots show slope coefficients and two-sided P values from linear regression, with dots in red showing relationships with multiple-hypothesis-corrected q values < false discovery rate 0.05 by Benjamini–Hochberg method, and insets show proportions of the total cohort for selected genes ± binomial 95% CI.
Fig. 3
Fig. 3. Integrated multimodal models predict OS.
a, c indices from RSFs by cancer type, stage and data modality (x axis) validated in fivefold cross-validation. ‘All’ denotes incorporation of all listed modalities into the model. *P < 0.05, unadjusted for multiple hypotheses, by one-sided t-test compared to next best-performing model. For stage IV NSCLC, prostate, CRC, pancreas and breast cancer, P values are 2 × 10−7, 0.0003, 0.005, 0.003 and 3 × 10−5, respectively. For stage I–III disease, P values are 0.003, 0.049, 0.008, 0.001 and 0.002, respectively. b, Scatter plot comparing mean c indices from fivefold cross-validation within MSK-CHORD (error bars represent ±95% CI) versus c indices from the same models trained on the entire MSK-CHORD cohort of a given cancer type and tested on an external validation dataset (the corresponding non-MSK-BPC cohorts). Colour legend is the same in both plots. For a,b, total number of patients in each cohort is given in Supplementary Table 6. c, Risk score distribution for the non-MSK-BPC cohorts and survival curves based on computed risk quartiles for patients with NSCLC.
Fig. 4
Fig. 4. Analysis of time to metastatic site colonization.
a, Time to metastatic site colonization among patients with LUAD, hormone-receptor-positive breast cancer, CRC with MSS, pancreatic adenocarcinoma (pancreas) and prostate cancer. Cohorts included the manually curated BPC and NLP-derived MSK-CHORD cohorts. b, Hazard ratios (colour), number of patients with alteration before site colonization (size) and statistical significance (Benjamini–Hochberg false discovery rate of 0.01, black outline) within MSK-CHORD. Analyses are adjusted for prior treatment, stage and histologic subtype. Only genes with at least one significant association in at least one cancer type (Benjamini–Hochberg q < 0.01) are shown. The inset depicts Kaplan–Meier curves of the cancer type and metastatic site highlighted in the grey rectangle stratified by RB1 status. WT, wild type.
Fig. 5
Fig. 5. SETD2 in LUAD.
a, log2[Odds ratio] and P value (from two-sided Fisher’s exact test) for associations of SETD2 oncogenic alterations with other oncogenic gene alterations. Red indicates q < 0.05 by Benjamini–Hochberg. Inset: frequencies of associated genes ± binomial 95% CI. b, Proportion with features ± binomial 95% CI. P values by two-sided Fisher’s exact test for mixed adenocarcinoma, mucinous, acinar, PDL1 and smoker variables were 0.63, 0.02, 0.87, 0.16 and 0.54, respectively. c, TMB with P value from two-sided Mann–Whitney U (P = 2 × 10−9). Box plots show medians and inner quartile ranges with ±95th percentile whiskers. For ac, n = 199 SETD2 mutant cases and n = 5,766 wild-type cases. d, OS from time of tumour sequencing and time to next treatment or death by treatment. Groups compared with Cox proportional hazards. e, Hazard ratios (mean ± 95% CI) for time to next treatment or death for patients with TMB < 10 mutations per megabase treated with immunotherapy based on SETD2 status. Left dashed line, hazard ratio for all cohorts in meta-analysis. Right dashed line, hazard ratio of 1.0.
Extended Data Fig. 1
Extended Data Fig. 1. Mismatch repair in immunohistochemistry and genomics.
a. Relationship between mismatch repair (MMR) proficiency (pMMR)/deficiency (dMMR) on immunohistochemistry as annotated by NLP and microsatellite instability (MSI) as determined by MSK-IMPACT (MSISensor cutoff of 10, excluding indeterminate cases). Boxplots depict median and inner quartile ranges (IQRs) with whiskers corresponding to 1.5xIQR. b. Kaplan-Meier curves show time to next treatment with stage IV colorectal cancer treated with immunotherapy (IO) stratified by MMR/MSI type.
Extended Data Fig. 2
Extended Data Fig. 2. Clinical and genomic representations of smoking.
a. Proportion of patients with NSCLC (of the whole cohort) and oncogenic EGFR or KRAS alterations by clinical (NLP-derived) smoking status and smoking mutational signature status (+/− binomial 95%CI) in MSK-CHORD. Inset shows the distribution of dominant mutational signatures for the clinical smoking NLP +, SigMA smoking signature – subgroup. b. Scatterplot showing SBS4 observed from whole exome sequencing vs. pack years smoked at time of initial visit based on manual curation. c. Scatterplot showing tumor mutational burden (TMB) vs. pack years smoked in the exome cohort. d. Bar charts showing proportion and binomial 95%CI with a driver EGFR or KRAS mutation among patients with a significant clinical smoking history ( ≥ 15 pack years) and a non-dominant smoking signature in the exome cohort. e. Boxplots showing median, Q1-Q3, and 5–95%ile tumor purity among patients with ≥15 pack year smoking history, stratified by SBS4 status in the exome cohort.
Extended Data Fig. 3
Extended Data Fig. 3. Comparison of survival analyses between The Cancer Genome Atlas (TCGA) and MSK-CHORD.
a. Volcano plots showing Cox proportional hazards models for specific oncogenic (by OncoKB) gene alterations (for all genes altered in at least 2% of the respective cohort) from time of diagnosis to time of death, right censored at last follow-up. For MSK-CHORD data is left truncated at time of sequencing (cohort entry) and only patients with stage I-III disease at diagnosis are shown. b. Selected representative survival curves stratified by oncogenic gene alteration presence. For example, STK11 mutation is associated with worse survival in both cohorts although requires a sufficiently large cohort to show statistical robustness. EGFR mutation is associated with better OS only in MSK-CHORD, as these patients were treated following the advent of EGFR-targeted therapy, which was not standard of care during the timeframe of TCGA.
Extended Data Fig. 4
Extended Data Fig. 4. Augmenting MSK-CHORD for predictive modeling.
Mean c-indices from random survival forests by cancer type and stage and data modality (x axis) validated in 5-fold cross-validation using a. Secondary genomic data and performance status within the MSK-CHORD pancreatic cancer cohort and b. radLongformer. Dots correspond to results from individual validation folds.
Extended Data Fig. 5
Extended Data Fig. 5. Risk modeling.
Risk score distribution for the non-MSK BPC cohorts and Kaplan-Meier survival curves based on computed risk quartiles for patients with pancreatic cancer.
Extended Data Fig. 6
Extended Data Fig. 6. RB1 alterations in metastatic samples.
Frequency (proportion of total cohort) of oncogenic RB1 alterations (+/− binomial 95%CI) in sequenced samples taken from the listed sites across the five studied cancer types. *=p < 0.05 by 2-sided Fisher Exact text.
Extended Data Fig. 7
Extended Data Fig. 7. Derived genomic features and risk of future metastasis.
Bubble plots showing hazard ratios (color), number of patients with alteration prior to site colonization (size) and statistical significance (Benjamini Hochberg FDR 0.01, black outline) for (a) pathway-level oncogenic alterations and (b) chromosome arm-level amplifications or deletions.
Extended Data Fig. 8
Extended Data Fig. 8. Metastatic potential of SETD2 mutant lung adenocarcinoma across multiple datasets.
Hazard ratios +/−95%CI from Cox proportional hazards models as described in Methods. Combined hazard ratios are from random effects meta-analyses for (a) CNS metastasis, (b) overall survival (OS), and (c) time to next treatment or death from immunotherapy start for patients with lung adenocarcinoma and TMB>10 mut/Mb.
Extended Data Fig. 9
Extended Data Fig. 9. Further SETD2 genomic correlations.
Volcano plot showing co-alteration or mutual exclusivity with SETD2 driver mutations in patients with lung adenocarcinoma in a large cohort with exome sequencing (Caris).

References

    1. Kehl, K. L. et al. Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset. Nat. Commun.12, 7304 (2021). - PMC - PubMed
    1. Fries, J. A. et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat. Commun.12, 2017 (2021). - PMC - PubMed
    1. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst.30, 6000–6010 (2017)
    1. Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature10.1038/s41586-023-06160-y (2023). - PMC - PubMed
    1. Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. & Luo, Y. A comparative study of pretrained language models for long clinical text. J. Am. Med. Inform. Assoc.30, 340–347 (2023). - PMC - PubMed

Substances

LinkOut - more resources