Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Feb 6:rs.3.rs-3914899.
doi: 10.21203/rs.3.rs-3914899/v1.

A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification

Affiliations

A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification

Madhumita Sushil et al. Res Sq. .

Update in

Abstract

Although supervised machine learning is popular for information extraction from clinical notes, creating large, annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs can reduce the need for large-scale data annotations. We curated a manually labeled dataset of 769 breast cancer pathology reports, labeled with 13 categories, to compare zero-shot classification capability of the GPT-4 model and the GPT-3.5 model with supervised classification performance of three model architectures: random forests classifier, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model. Across all 13 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, the LSTM-Att model (average macro F1 score of 0.83 vs. 0.75). On tasks with a high imbalance between labels, the differences were more prominent. Frequent sources of GPT-4 errors included inferences from multiple samples and complex task design. On complex tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of large-scale data labeling. However, if the use of LLMs is prohibitive, the use of simpler supervised models with large annotated datasets can provide comparable results. LLMs demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for curating large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in observational clinical studies.

PubMed Disclaimer

Conflict of interest statement

Financial Disclosures and Conflicts of Interest MS, TZ, DM, ZZ, AW, YY, and YQ report no financial associations or conflicts of interest. AJB is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson and Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. Atul Butte receives royalty payments through Stanford University, for several patents and other disclosures licensed to NuMedii and Personalis. Atul Butte’s research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None of these entities had any bearing on this research or interpretation of the findings.

Figures

Figure 1:
Figure 1:
Flow diagram representing inclusion and exclusion criteria for breast cancer pathology report selection before data annotation. Number of patients and number of clinical notes is represented at each stage. The final annotated subset represents a random sample of the final representative dataset obtained in this manner.
Figure 2:
Figure 2:
Sample of an annotated pathology report, along with the corresponding document-level annotation schema. Irrelevant note refers to those that are not related to a breast cancer diagnosis (further details in the annotation guidelines). The Unknown labels refer to the cases where a label could not be inferred based on the information provided in the pathology report.
Figure 3:
Figure 3:
Class distribution for all tasks in the training data for supervised classification.
Figure 4:
Figure 4:
Classification performance, as measured by Macro F1, for different models for each classification task. All models other than GPT-3.5 and GPT-4 are trained in a supervised setup on task-specific training data. GPT-3.5 and GPT-4 models are evaluated zero-shot, i.e., in an unsupervised manner.
Figure 5:
Figure 5:
Confusion matrices for GPT-4 classification in (a) single-labeled tasks, (b) multi-labeled tasks.
Figure 5:
Figure 5:
Confusion matrices for GPT-4 classification in (a) single-labeled tasks, (b) multi-labeled tasks.

References

    1. Wu H. et al. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. npj Digit. Med. 5, 1–15 (2022). - PMC - PubMed
    1. Fu S. et al. Recommended practices and ethical considerations for natural language processing-assisted observational research: A scoping review. Clin Transl Sci 16, 398–411 (2023). - PMC - PubMed
    1. Brown T. et al. Language Models are Few-Shot Learners. in Advances in Neural Information Processing Systems vol. 33 1877–1901 (Curran Associates, Inc., 2020).
    1. Kojima T., Gu S. (Shane), Reid M., Matsuo Y. & Iwasawa Y. Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems 35, 22199–22213 (2022).
    1. Agrawal M., Hegselmann S., Lang H., Kim Y. & Sontag D. Large language models are few-shot clinical information extractors. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 1998–2022 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

Publication types

LinkOut - more resources