In-context learning enables multimodal large language models to classify cancer pathology images

Dyke Ferber^{1

2

3}, Georg Wölflein⁴, Isabella C Wiest^{3

5}, Marta Ligero³, Srividhya Sainath³, Narmin Ghaffari Laleh³, Omar S M El Nahhas³, Gustav Müller-Franzes⁶, Dirk Jäger^{1

2}, Daniel Truhn⁶, Jakob Nikolas Kather^{7

8

9

10}

Affiliations

¹ National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany.
² Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany.
³ Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
⁴ School of Computer Science, University of St Andrews, St Andrews, UK.
⁵ Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.
⁶ Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
⁷ National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany. jakob_nikolas.kather@tu-dresden.de.
⁸ Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany. jakob_nikolas.kather@tu-dresden.de.
⁹ Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany. jakob_nikolas.kather@tu-dresden.de.
¹⁰ Department of Medicine I, University Hospital Dresden, Dresden, Germany. jakob_nikolas.kather@tu-dresden.de.

PMID: 39572531
PMCID: PMC11582649
DOI: 10.1038/s41467-024-51465-9

In-context learning enables multimodal large language models to classify cancer pathology images

Dyke Ferber et al. Nat Commun. 2024.

. 2024 Nov 21;15(1):10104.

doi: 10.1038/s41467-024-51465-9.

Authors

Affiliations

¹ National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany.
² Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany.
³ Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
⁴ School of Computer Science, University of St Andrews, St Andrews, UK.
⁵ Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.
⁶ Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
⁷ National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany. jakob_nikolas.kather@tu-dresden.de.
⁸ Department of Medical Oncology, Heidelberg University Hospital, Heidelberg, Germany. jakob_nikolas.kather@tu-dresden.de.
⁹ Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany. jakob_nikolas.kather@tu-dresden.de.
¹⁰ Department of Medicine I, University Hospital Dresden, Dresden, Germany. jakob_nikolas.kather@tu-dresden.de.

PMID: 39572531
PMCID: PMC11582649
DOI: 10.1038/s41467-024-51465-9

Abstract

Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare the following competing interests. O.S.M.E.N. holds shares in StratifAI GmbH. J.N.K. declares consulting services for Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK, and Scailyte, Basel, Switzerland; furthermore J.N.K. holds shares in Kather Consulting, Dresden, Germany; and StratifAI GmbH, Dresden, Germany, and has received honoraria for lectures and advisory board participation by AstraZeneca, Bayer, Eisai, MSD, BMS, Roche, Pfizer and Fresenius. D.T. received honoraria for lectures by Bayer and holds shares in StratifAI GmbH, Germany. The remaining authors declare no competing interests.

Figures

**Fig. 1. Comprehensive schematic.**
This figure presents a systematic overview of the three histopathology benchmarking datasets, detailing the number of samples incorporated in our study (Panel A). A selection of random test images was drawn from each of these datasets for evaluation using three distinct methodologies: Zero-Shot Classification (Method 1), random few-shot sampling (Method 2), and *kNN*-based selection (Method 3). For the latter, feature extraction was performed using the *Phikon* ViT-B 40 M Pancancer model (*). Cosine similarity was used as the comparison metric between the target image and its closest k neighbors in embedding space. As a benchmark against GPT-4 *ICL*, we trained four image classifiers (indicated by +, namely ResNet-18, ResNet15, Tiny-Vit, and Small-Vit) via transfer learning from ImageNet for each target image (Panel B). For an in-depth understanding of these methods, please refer to Algorithm 1 and the Experimental Design section. * The BACK (background) label was excluded from the analysis.

**Fig. 2. In-context learning for vision-language models.**
Panel A shows that classification accuracy on a simple task detecting tumor (TUM) versus non-tumor (NORM) tiles from the CRC100K dataset can drastically be improved by leveraging ICL through randomly sampled, few-shot image samples. Additionally, we compare random and kNN-based image sampling on two datasets and show that *kNN*-based image sampling improves model performance in classifying images from both MHIST (left) and PatchCamelyon (right), especially when scaling the number of few-shot samples (Panel B). Note that samples have been slightly shifted on the x-axis for visibility. The y-axis denotes the mean accuracy with lower and upper 2.5% confidence intervals (CIs) from 100,000 bootstrap iterations for both panels, respectively. Source data are provided as a Source Data file.

**Fig. 3. Performance analysis of GPT-4V with *kNN* ICL on PatchCamelyon and MHIST datasets.**
This figure is divided into two sections, with Panel A and B focusing on PatchCamelyon (to the left) and the MHIST dataset (right subpanel) respectively. In A, line graphs illustrate the average performance of GPT-4V when used with *kNN*-based in-context learning relative to several specialist image classification and histopathology foundation models: We first compare GPT-4V with ResNet-18, ResNet-50 and two Vision Transformers (ViT-Tiny and ViT-Small) where the number of ICL samples for GPT-4V equals the number of training samples for the image classification models (1, top left). Additionally, we compare the same vision classifiers, trained on the full respective datasets (2, bottom left), and the performance of two histopathology foundation models, Phikon (3, top right) and UNI (4, bottom right). For the latter, we compare GPT-4V against training a linear layer on top of the pre-trained foundation model (for one, three, five, and ten epochs) and *kNN* classification. Note that in these cases, the models are trained on the full datasets, and the term ’# Samples’ is used to denote the number of few-shot ICL samples for GPT-4V only. The Y-axis displays the average accuracy across all labels, derived from 100,000 bootstrapping steps. All relevant metrics (accuracy, lower and upper confidence intervals) are summarized in Supplementary Tables 1–3. Panel B presents a series of heatmaps, highlighting the absolute and relative performance per label in zero-, three-, five-, and ten-shot *kNN-*based sampling scenarios, each with a sample size of n = 60. Lastly, the spider plot in Panel C highlights the superiority of 10-shot GPT-4V in classification performance for both datasets when compared under equitable conditions to two ResNet-style models and two vision transformers. Source data are provided as a Source Data file.

**Fig. 4. Performance analysis of GPT-4V with *kNN-*based sampling on the CRC100K dataset.**
The line graphs (Panel A) show the comparative average performance of GPT-4V with kNN-based in-context learning against the four image classification models (1) when trained on the same number of images as used as example images for in-context learning with GPT-4V. Additionally, we show how in-context learning can reduce the performance gap between GPT-4V and the respective image classifiers when trained on the entire datasets respectively (2) as well as in comparison to the state-of-the-art foundation models Phikon and UNI. # Samples refers to the count of few-shot *ICL* samples for GPT-4V and training samples for the other models in 1, while for all other settings, the models are trained on the entire training data. The y-axis represents the mean accuracy across all labels, computed using 100,000 bootstrapping iterations. Detailed average accuracy values, including confidence intervals, are summarized in Supplementary Table 1. Panel B features confusion matrices for GPT-4V in both zero-, and five-shot *kNN*-based sampling scenarios (n = 120 samples). The spider plot showcases the average classification accuracy per label per number of *kNN*-sampled shots, revealing a general trend towards increased classification accuracy across most labels with scaling of the number of few-shot image samples (Panel C). Source data are provided as a Source Data file.

**Fig. 5. Few-shot sampling improves text-based reasoning.**
Panel A depicts the workflow, starting from GPT-4V’s initial prediction and its reasoning process (‘thoughts’), to the generation of text feature embeddings with Ada 002. The panel of t-SNEs demonstrates the evolution from a zero-shot framework on the far left, advancing through one-, three-, and five-shot *kNN* sampling to the right. All data is obtained from the CRC100K dataset. In the t-SNE plots, color coding distinguishes between the model’s final classifications (‘Answers’, top) and the ground truth (’Labels’, bottom). The introduction of few-shot image sampling noticeably refines the model’s textual reasoning, as evidenced by the formation of more distinct clusters in alignment with the model’s own responses (top) and the underlying ground truth (bottom). S denotes silhouette scores, which are calculated for each t-SNE. Complementary to these visualizations, Supplementary Fig. 2 features word clouds that further illustrate the alignment of the model’s vocabulary with clinical diagnoses, highlighting key terms such as “lymph node” for normal tissue and “metastatic / breast cancer” for malignancies, thereby enhancing the interpretability of the model’s diagnostic reasoning process. In Panel B, we present two exemplary scenarios to demonstrate the potential superiority of integrated vision-language models over stand-alone image classification models. On the left, an image is displayed where the original annotation identified the sample as stroma (STR), yet GPT-4V categorizes it as tumor (TUM). The rationale provided by the model appears plausible, notably pointing out several abnormally shaped nuclei, visible, for instance, in the lower right corner. This sample indeed appears to represent a borderline case. When comparing the top 500 closest patch embeddings to the reference image, a dominant fraction is classified as tumor (67%), with a lesser proportion being labeled as stroma (32%) and a negligible percentage (<1%) as lymphocytes or regular colon epithelium. The exploration of GPT-4V’s interpretive process can help identify and understand such complex edge cases that go beyond what is possible with conventional image classifiers alone. Right: Chicken-wire patterns are described in the histology of liposarcoma, which arises from adipocyte precursor cells. This description stems from its resemblance to chicken wire fences (shown to the right). GPT-4V effectively leverages this knowledge from another context to describe the morphology of the adipocytes shown in this image. This way of performing ‘transfer learning’ could have strong implications in teaching. * The image name in the CRC100K cohort is STR-TCGA-VEMARASN. + The image name in the CRC100K cohort is ADI-TCGA-QFVSMHDD.

See this image and copyright information in PMC

References

1. Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med.29, 2396–2398 (2023). - PubMed
1. Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer. Nat. Cancer3, 1151–1164 (2022). - PMC - PubMed
1. Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer3, 1026–1038 (2022). - PubMed
1. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). - PMC - PubMed
1. Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med.24, 1559–1567 (2018). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

In-context learning enables multimodal large language models to classify cancer pathology images

Affiliations

In-context learning enables multimodal large language models to classify cancer pathology images

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical