Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 21;15(1):10104.
doi: 10.1038/s41467-024-51465-9.

In-context learning enables multimodal large language models to classify cancer pathology images

Affiliations

In-context learning enables multimodal large language models to classify cancer pathology images

Dyke Ferber et al. Nat Commun. .

Abstract

Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare the following competing interests. O.S.M.E.N. holds shares in StratifAI GmbH. J.N.K. declares consulting services for Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK, and Scailyte, Basel, Switzerland; furthermore J.N.K. holds shares in Kather Consulting, Dresden, Germany; and StratifAI GmbH, Dresden, Germany, and has received honoraria for lectures and advisory board participation by AstraZeneca, Bayer, Eisai, MSD, BMS, Roche, Pfizer and Fresenius. D.T. received honoraria for lectures by Bayer and holds shares in StratifAI GmbH, Germany. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Comprehensive schematic.
This figure presents a systematic overview of the three histopathology benchmarking datasets, detailing the number of samples incorporated in our study (Panel A). A selection of random test images was drawn from each of these datasets for evaluation using three distinct methodologies: Zero-Shot Classification (Method 1), random few-shot sampling (Method 2), and kNN-based selection (Method 3). For the latter, feature extraction was performed using the Phikon ViT-B 40 M Pancancer model (*). Cosine similarity was used as the comparison metric between the target image and its closest k neighbors in embedding space. As a benchmark against GPT-4 ICL, we trained four image classifiers (indicated by +, namely ResNet-18, ResNet15, Tiny-Vit, and Small-Vit) via transfer learning from ImageNet for each target image (Panel B). For an in-depth understanding of these methods, please refer to Algorithm 1 and the Experimental Design section. * The BACK (background) label was excluded from the analysis.
Fig. 2
Fig. 2. In-context learning for vision-language models.
Panel A shows that classification accuracy on a simple task detecting tumor (TUM) versus non-tumor (NORM) tiles from the CRC100K dataset can drastically be improved by leveraging ICL through randomly sampled, few-shot image samples. Additionally, we compare random and kNN-based image sampling on two datasets and show that kNN-based image sampling improves model performance in classifying images from both MHIST (left) and PatchCamelyon (right), especially when scaling the number of few-shot samples (Panel B). Note that samples have been slightly shifted on the x-axis for visibility. The y-axis denotes the mean accuracy with lower and upper 2.5% confidence intervals (CIs) from 100,000 bootstrap iterations for both panels, respectively. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Performance analysis of GPT-4V with kNN ICL on PatchCamelyon and MHIST datasets.
This figure is divided into two sections, with Panel A and B focusing on PatchCamelyon (to the left) and the MHIST dataset (right subpanel) respectively. In A, line graphs illustrate the average performance of GPT-4V when used with kNN-based in-context learning relative to several specialist image classification and histopathology foundation models: We first compare GPT-4V with ResNet-18, ResNet-50 and two Vision Transformers (ViT-Tiny and ViT-Small) where the number of ICL samples for GPT-4V equals the number of training samples for the image classification models (1, top left). Additionally, we compare the same vision classifiers, trained on the full respective datasets (2, bottom left), and the performance of two histopathology foundation models, Phikon (3, top right) and UNI (4, bottom right). For the latter, we compare GPT-4V against training a linear layer on top of the pre-trained foundation model (for one, three, five, and ten epochs) and kNN classification. Note that in these cases, the models are trained on the full datasets, and the term ’# Samples’ is used to denote the number of few-shot ICL samples for GPT-4V only. The Y-axis displays the average accuracy across all labels, derived from 100,000 bootstrapping steps. All relevant metrics (accuracy, lower and upper confidence intervals) are summarized in Supplementary Tables 1–3. Panel B presents a series of heatmaps, highlighting the absolute and relative performance per label in zero-, three-, five-, and ten-shot kNN-based sampling scenarios, each with a sample size of n = 60. Lastly, the spider plot in Panel C highlights the superiority of 10-shot GPT-4V in classification performance for both datasets when compared under equitable conditions to two ResNet-style models and two vision transformers. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Performance analysis of GPT-4V with kNN-based sampling on the CRC100K dataset.
The line graphs (Panel A) show the comparative average performance of GPT-4V with kNN-based in-context learning against the four image classification models (1) when trained on the same number of images as used as example images for in-context learning with GPT-4V. Additionally, we show how in-context learning can reduce the performance gap between GPT-4V and the respective image classifiers when trained on the entire datasets respectively (2) as well as in comparison to the state-of-the-art foundation models Phikon and UNI. # Samples refers to the count of few-shot ICL samples for GPT-4V and training samples for the other models in 1, while for all other settings, the models are trained on the entire training data. The y-axis represents the mean accuracy across all labels, computed using 100,000 bootstrapping iterations. Detailed average accuracy values, including confidence intervals, are summarized in Supplementary Table 1. Panel B features confusion matrices for GPT-4V in both zero-, and five-shot kNN-based sampling scenarios (n = 120 samples). The spider plot showcases the average classification accuracy per label per number of kNN-sampled shots, revealing a general trend towards increased classification accuracy across most labels with scaling of the number of few-shot image samples (Panel C). Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Few-shot sampling improves text-based reasoning.
Panel A depicts the workflow, starting from GPT-4V’s initial prediction and its reasoning process (‘thoughts’), to the generation of text feature embeddings with Ada 002. The panel of t-SNEs demonstrates the evolution from a zero-shot framework on the far left, advancing through one-, three-, and five-shot kNN sampling to the right. All data is obtained from the CRC100K dataset. In the t-SNE plots, color coding distinguishes between the model’s final classifications (‘Answers’, top) and the ground truth (’Labels’, bottom). The introduction of few-shot image sampling noticeably refines the model’s textual reasoning, as evidenced by the formation of more distinct clusters in alignment with the model’s own responses (top) and the underlying ground truth (bottom). S denotes silhouette scores, which are calculated for each t-SNE. Complementary to these visualizations, Supplementary Fig. 2 features word clouds that further illustrate the alignment of the model’s vocabulary with clinical diagnoses, highlighting key terms such as “lymph node” for normal tissue and “metastatic / breast cancer” for malignancies, thereby enhancing the interpretability of the model’s diagnostic reasoning process. In Panel B, we present two exemplary scenarios to demonstrate the potential superiority of integrated vision-language models over stand-alone image classification models. On the left, an image is displayed where the original annotation identified the sample as stroma (STR), yet GPT-4V categorizes it as tumor (TUM). The rationale provided by the model appears plausible, notably pointing out several abnormally shaped nuclei, visible, for instance, in the lower right corner. This sample indeed appears to represent a borderline case. When comparing the top 500 closest patch embeddings to the reference image, a dominant fraction is classified as tumor (67%), with a lesser proportion being labeled as stroma (32%) and a negligible percentage (<1%) as lymphocytes or regular colon epithelium. The exploration of GPT-4V’s interpretive process can help identify and understand such complex edge cases that go beyond what is possible with conventional image classifiers alone. Right: Chicken-wire patterns are described in the histology of liposarcoma, which arises from adipocyte precursor cells. This description stems from its resemblance to chicken wire fences (shown to the right). GPT-4V effectively leverages this knowledge from another context to describe the morphology of the adipocytes shown in this image. This way of performing ‘transfer learning’ could have strong implications in teaching. * The image name in the CRC100K cohort is STR-TCGA-VEMARASN. + The image name in the CRC100K cohort is ADI-TCGA-QFVSMHDD.

References

    1. Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med.29, 2396–2398 (2023). - PubMed
    1. Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer. Nat. Cancer3, 1151–1164 (2022). - PMC - PubMed
    1. Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer3, 1026–1038 (2022). - PubMed
    1. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). - PMC - PubMed
    1. Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med.24, 1559–1567 (2018). - PMC - PubMed

Publication types

LinkOut - more resources