Comparative Study

. 2024 Oct 1;31(10):2315-2327.

doi: 10.1093/jamia/ocae146.

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Madhumita Sushil¹, Travis Zack^{1

2}, Divneet Mandair^{1

2}, Zhiwei Zheng³, Ahmed Wali³, Yan-Ning Yu³, Yuwei Quan³, Dmytro Lituiev¹, Atul J Butte^{1

2

4

5}

Affiliations

¹ Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States.
² Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, United States.
³ University of California, Berkeley, Berkeley, CA 94720, United States.
⁴ Center for Data-driven Insights and Innovation, University of California, Office of the President, Oakland, CA 94607, United States.
⁵ Department of Pediatrics, University of California, San Francisco, San Francisco, CA 94158, United States.

PMID: 38900207
PMCID: PMC11413420
DOI: 10.1093/jamia/ocae146

Comparative Study

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Madhumita Sushil et al. J Am Med Inform Assoc. 2024.

. 2024 Oct 1;31(10):2315-2327.

doi: 10.1093/jamia/ocae146.

Authors

Madhumita Sushil¹, Travis Zack^{1

2}, Divneet Mandair^{1

2}, Zhiwei Zheng³, Ahmed Wali³, Yan-Ning Yu³, Yuwei Quan³, Dmytro Lituiev¹, Atul J Butte^{1

2

4

5}

Affiliations

¹ Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States.
² Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, United States.
³ University of California, Berkeley, Berkeley, CA 94720, United States.
⁴ Center for Data-driven Insights and Innovation, University of California, Office of the President, Oakland, CA 94607, United States.
⁵ Department of Pediatrics, University of California, San Francisco, San Francisco, CA 94158, United States.

PMID: 38900207
PMCID: PMC11413420
DOI: 10.1093/jamia/ocae146

Abstract

Objective: Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations.

Materials and methods: We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model.

Results: Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set.

Discussion: On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results.

Conclusions: GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.

Keywords: breast cancer; electronic health records; large language models; natural language processing; pathology.

PubMed Disclaimer

Conflict of interest statement

M.S. reports no financial associations or conflicts of interest. T.Z. is a medical advisor and minor shareholder at OpenEvidence.com. D.M. is a consultant to Third Rock Ventures. Z.Z. reports no financial associations or conflicts of interest. A.W. is currently an employee of Abbott. Y.-N.Y. is currently an employee of City of Hope. Y.Q. is currently an employee of X-camp Academy. D.L. is currently an employee and minor shareholder of Johnson and Johnson and a co-founder and major shareholder of Synthez AI Corp. A.J.B. is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson and Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. A.J.B. receives royalty payments through Stanford University, for several patents and other disclosures licensed to NuMedii and Personalis. A.J.B.’s research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None of these entities had any bearing on this research or interpretation of the findings.

Figures

**Figure 1.**
Flow diagram representing inclusion and exclusion criteria for breast cancer pathology report selection before data annotation. Number of patients and number of clinical notes is represented at each stage. The final annotated subset represents a random sample of the final representative dataset obtained in this manner.

**Figure 2.**
Sample of an annotated pathology report, along with the corresponding document-level annotation schema. The *Unknown* labels refer to the cases where a label could not be inferred based on the information provided in the pathology report.

**Figure 3.**
Class distribution for all tasks in the training data for supervised classification.

**Figure 4.**
Classification performance, as measured by % Macro F1 score, for different models for each classification task. The LSTM model, the UCSF-BERT model, and the Random Forests model were trained in a supervised setup on task-specific training data. All other models (GPT-3.5, GPT-4, Starling-7B-beta, and ClinicalCamel-70B) were queried and evaluated in a zero-shot setup, ie, without any further task-specific training.

See this image and copyright information in PMC

Update of

A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification.
Sushil M, Zack T, Mandair D, Zheng Z, Wali A, Yu YN, Quan Y, Butte AJ. Sushil M, et al. Res Sq [Preprint]. 2024 Feb 6:rs.3.rs-3914899. doi: 10.21203/rs.3.rs-3914899/v1. Res Sq. 2024. Update in: J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146. PMID: 38405831 Free PMC article. Updated. Preprint.

References

1. Wu H, Wang M, Wu J, et al. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med. 2022;5(1):186. - PMC - PubMed
1. Fu S, Wang L, Moon S, et al. Recommended practices and ethical considerations for natural language processing-assisted observational research: a scoping review. Clin Transl Sci. 2023;16(3):398-411. - PMC - PubMed
1. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, et al., eds. Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc.; 2020:1877-1901.
1. Kojima T, Gu S, Shane R, Matsuo M, Iwasawa Y. Y. . Large language models are zero-shot reasoners. In: Koyejo S, Mohamed S, Agarwal A, et al., eds. Adv Neural Inform Process Syst. 2022;35:22199-22213.
1. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 1998–2022 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Affiliations

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical