Large language model trained on clinical oncology data predicts cancer progression

Menglei Zhu^#¹, Hui Lin^#², Jue Jiang¹, Abbas J Jinia¹, Justin Jee¹, Karl Pichotta¹, Michele Waters¹, Doori Rose¹, Nikolaus Schultz¹, Sulov Chalise¹, Lohit Valleru¹, Olivier Morin², Jean Moran¹, Joseph O Deasy¹, Shirin Pilai¹, Chelsea Nichols¹, Gregory Riely¹, Lior Z Braunstein³, Anyi Li⁴

Affiliations

¹ Memorial Sloan Kettering Cancer Center, New York, NY, USA.
² University of California San Francisco, San Francisco, CA, USA.
³ Memorial Sloan Kettering Cancer Center, New York, NY, USA. braunstl@mskcc.org.
⁴ Memorial Sloan Kettering Cancer Center, New York, NY, USA. lia5@mskcc.org.

^# Contributed equally.

PMID: 40604229
PMCID: PMC12223279
DOI: 10.1038/s41746-025-01780-2

Large language model trained on clinical oncology data predicts cancer progression

Menglei Zhu et al. NPJ Digit Med. 2025.

. 2025 Jul 2;8(1):397.

doi: 10.1038/s41746-025-01780-2.

Authors

Affiliations

¹ Memorial Sloan Kettering Cancer Center, New York, NY, USA.
² University of California San Francisco, San Francisco, CA, USA.
³ Memorial Sloan Kettering Cancer Center, New York, NY, USA. braunstl@mskcc.org.
⁴ Memorial Sloan Kettering Cancer Center, New York, NY, USA. lia5@mskcc.org.

^# Contributed equally.

PMID: 40604229
PMCID: PMC12223279
DOI: 10.1038/s41746-025-01780-2

Abstract

Subspecialty knowledge barriers have limited the adoption of large language models (LLMs) in oncology. We introduce Woollie, an open-source, oncology-specific LLM trained on real-world data from Memorial Sloan Kettering Cancer Center (MSK) across lung, breast, prostate, pancreatic, and colorectal cancers, with external validation using University of California, San Francisco (UCSF) data. Woollie surpasses ChatGPT in medical benchmarks and excels in eight non-medical benchmarks. Analyzing 39,319 radiology impression notes from 4002 patients, it achieved an overall area under the receiver operating characteristic curve (AUROC) of 0.97 for cancer progression prediction on MSK data, including a notable 0.98 AUROC for pancreatic cancer. On UCSF data, it achieved an overall AUROC of 0.88, excelling in lung cancer detection with an AUROC of 0.95. As the first oncology specific LLM validated across institutions, Woollie demonstrates high accuracy and consistency across cancer types, underscoring its potential to enhance cancer progression analysis.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Stacked refinement of Woollie models, from pre-training to oncology domain fine-tuning.**
a Overview of the pre-training process for the baseline Llama models, highlighting unsupervised training with a dataset featuring 1.4 trillion tokens, including the Common Crawl dataset and knowledge from medicine, engineering, mathematics, biology, etc. This stage encodes foundational knowledge across various domains into the models, available in four sizes: 7 billion (B), 13 billion, 33 billion, and 65 billion parameters. b Description of the domain knowledge alignment process for the baseline Llama models using the Chain of Thoughts (COT), Alpaca, OpenAssistant, and InstructionWild datasets, creating the Woollie Foundation models. This supervised learning step involves training models to answer questions correctly, enhancing their reasoning, logic, and conversational abilities. c Further alignment with the general medical and oncology domain using datasets of MedQuAD, PubMedQA, MedMCQA, and USMLE, leading to the development of Woollie Medicine and Woollie. Woollie Medicine is an intermediate model for evaluating the effectiveness of stacked alignment methods. d Final fine-tuning within the oncology domain using a proprietary MSK dataset of 38,719 radiology reports manually curated by radiologists from 3402 patients across five cancer types: breast, colorectal, lung, pancreatic, and prostate. This step trains the Woollie MSK models (7B and 33B versions) to determine tumor progression, with models being benchmarked against a test dataset and cross-institutionally validated with a UCSF dataset covering lung, breast, and prostate cancers. e The summary of fourteen Woollie models detailing the datasets used for alignment and fine-tuning. The table categorizes the datasets into three groups: reasoning, logic, and conversation; general medical domain; and oncology domain, including the proprietary MSK dataset of radiology impressions. asterisk notes the selective use of 10,000 high-quality examples from the OpenAssistant dataset (OASST1), which contains 160,000 human-created and annotated conversations in various languages.

**Fig. 2. Performance comparisons of Woollie models on benchmarks, the influence of model size, and improvements in the medical domain.**
a The stacked alignment strategy mitigates catastrophic forgetting in LLM. When comparing the baseline Llama 65B model to the Woollie 65B model, modest improvements are observed in standard benchmarks testing reasoning and logic in non-medical domains. In the medical domain, however, significant improvements are evident. The performance in the medical domain is accentuated in shaded sections, illustrating how stacked alignment enhances performance while preserving capabilities in general domains. b Comparison of all stacked aligned 33B models (Woollie Foundation, Woollie Medicine, and Woollie) with non-stacked aligned 33B models (Woollie All). The results clearly demonstrate that stacked alignment significantly improves model performance incrementally, whereas non-stacked alignment leads to catastrophic forgetting, resulting in poorer performance. c A scaling study plotted the performance of Woollie models against model sizes ranging from 7B to 65B parameters across 11 tests. The results suggest that larger models generally achieve better performance, though there is a noticeable plateau in performance enhancement between the 33B and 65B models. This informs the decision-making process for model selection in clinical applications, considering the balance between performance and resource consumption. d A detailed performance comparison among twelve Woollie and four Llama models across 11 tests is depicted in a heatmap. The color intensity in each cell reflects the mean relative performance in each test. The heatmap is divided into non-medical domains on the left and medical domains on the right, categorizing the models into Llama, Woollie Foundation, Woollie Medicine, and Woollie. This visualization underscores the performance improvements achieved through stacked alignment, with a clear transition from left to right, highlighting advancements in medical and oncology domains across the models.

**Fig. 3. Sociodemographic characteristics and patient cohort distribution in the MSK radiology impression dataset.**
a Sociodemographic distribution of the MSK radiology impression dataset, categorized by “Age at Procedure,” “Birth Sex,” “Marital Status,” “Race,” “Religion,” and “Ethnicity.” b Tables provide the number of reports and unique patients for each cancer type. c The MSK radiology impression dataset, manually curated by radiologists, classifies cancer progression into five categories: Progressing/Worsening/Enlarging, Stable/No change, Improving/Responding, Not Stated/Indeterminate, and Mixed. For each cancer type—colorectal, pancreatic, breast, prostate, lung—we detail the number of reports and the distribution percentage of these five labels within each type. d Distribution of the MSK radiology impression dataset, featuring labels for progression and non-progression across five cancers: breast, colorectal, lung, prostate, and pancreatic, comprising 38,719 reports from 3402 patients.

**Fig. 4. High-performance cancer progression prediction by Woollie MSK models fine-tuned on MSK oncology data, including their performance metrics and comparisons with other models.**
a Comparison of Llama, Woollie Foundation, Woollie Medicine, Woollie, Woollie MSK models, and non-stacked aligned Woollie All models on the MSK radiology impression dataset (rad_imp) for binary classification of cancer progression. The Woollie 33B model achieves an accuracy of 0.79, outperforming the 65B model at 0.77. Fine-tuned Woollie MSK models achieve superior accuracies of 0.86 (7B) and 0.90 (33B) across all five cancer types. Non-stacked aligned Woollie All models lag behind the stacked aligned Woollie models. b Woollie MSK models, fine-tuned on top of existing Woollie models, show improvement in the general medical domain on tests like PubMedQA, USMLE, and MedMCQA. Fine-tuning on the MSK oncology dataset enhances performance: Woollie MSK 33B’s accuracy increased to 0.83 from 0.80 in PubMedQA, 0.48 from 0.45 in MedMCQA, and 0.53 from 0.49 in USMLE. The Woollie MSK 7B model similarly shows gains, with accuracies improving notably across all tests. c ROC plot illustrating the performance of Woollie MSK 7B on the MSK dataset, with a significant increase from a 0.50 AUROC for the baseline Llama model to 0.94 for Woollie MSK 7B. The right panel shows the larger Woollie MSK 33B model reaching an AUROC of 0.97, compared to 0.87 for the Llama model. d Comparative performance analysis between the Llama 7B and Woollie MSK 7B models five labels reveals a significantly higher (p < 0.001) micro-average AUROC for Woollie MSK 7B at 0.93 on the right, compared to 0.63 for Llama 7B on the left. A comparison between Llama 33B and Woollie MSK 33B models on the same dataset and labels shows an AUROC of 0.97 for Woollie MSK 33B versus 0.8 for Llama 33B. Furthermore, the Woollie MSK 33B model demonstrates enhanced performance in the confusion matrix with a higher accuracy of 0.82 compared to 0.76 for Woollie MSK 7B.

**Fig. 5. Cross-institution validation of model performance in predicting cancer progression on MSK and UCSF datasets.**
a Sociodemographic distribution of the UCSF radiology impression dataset, used exclusively as an independent validation dataset. This dataset was not used for Woollie MSK fine-tuning. The UCSF dataset has different sociodemographic distribution than MSK dataset. b UCSF dataset includes 600 reports from 600 unique patients, covering prostate, lung, and breast cancers, distinct from the MSK data. c Fine-tuned with MSK oncology data, ROC curves for Woollie MSK 7B and 33B models demonstrate performance on the UCSF dataset. The Woollie MSK 7B model achieves an AUROC of 0.89, slightly better than the 0.88 for the Woollie MSK 33B, suggesting that smaller models may outperform larger ones in this dataset due to less bias but increased variance. d Comparison of Woollie MSK models on both MSK and UCSF datasets shows superior performance on MSK data, though the knowledge transfer to UCSF is clear. Despite lagging behind the MSK performance, the trend is consistent, indicating effective cross-institutional validation in cancer progression on an open-source LLM. e Precision scores are visualized on a heatmap, with varying color intensities indicating effectiveness in detecting different types of cancer, noting the absence of data for colorectal and pancreatic cancers in the UCSF dataset. Precision scores are crucial for closely monitoring progressive cases. While Woollie MSK 7B shows higher AUROC scores on the UCSF dataset, Woollie MSK 33B excels in precision, significantly notably with a score of 0.99 in detecting lung cancer. The fine-tuned Woollie models significantly outperform Llama models in accurately tracking cancer progression, underscoring their practical application of inter-institutionally transferred knowledge.

**Fig. 6. Model parsing of disease trajectories and biology among different malignancies.**
Salient topics from radiology reports using Woollie MSK 33B on reports from MSK patients with cancers of the a breast, b lung, c colorectal, d prostate, and e pancreatic. f Summarization of the salient topics among all diseases, notably enriched for sites of distant metastatic seeing. g Sankey plots demonstrating trajectories of metastatic disease across five cancer types.

See this image and copyright information in PMC

References

1. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst.35, 27730–27744 (2022).
1. OpenAi. GPT-4 technical report (OpenAi, 2023).
1. Eloundou, T., Manning, S., Mishkin, P. & Rock, D. Gpts are gpts: an early look at the labor market impact potential of large language models. Preprint at https://arxiv.org/abs/2303.10130 (2023).
1. Will ChatGPT transform healthcare? Nat. Med.29, 505–506 (2023) - PubMed
1. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023).

Grants and funding

P30 CA008748/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large language model trained on clinical oncology data predicts cancer progression

Affiliations

Large language model trained on clinical oncology data predicts cancer progression

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials