Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct;30(10):2924-2935.
doi: 10.1038/s41591-024-03141-0. Epub 2024 Jul 22.

A foundation model for clinical-grade computational pathology and rare cancers detection

Affiliations

A foundation model for clinical-grade computational pathology and rare cancers detection

Eugene Vorontsov et al. Nat Med. 2024 Oct.

Abstract

The analysis of histopathology images with artificial intelligence aims to enable clinical decision support systems and precision medicine. The success of such applications depends on the ability to model the diverse patterns observed in pathology images. To this end, we present Virchow, the largest foundation model for computational pathology to date. In addition to the evaluation of biomarker prediction and cell identification, we demonstrate that a large foundation model enables pan-cancer detection, achieving 0.95 specimen-level area under the (receiver operating characteristic) curve across nine common and seven rare cancers. Furthermore, we show that with less training data, the pan-cancer detector built on Virchow can achieve similar performance to tissue-specific clinical-grade models in production and outperform them on some rare variants of cancer. Virchow's performance gains highlight the value of a foundation model and open possibilities for many high-impact applications with limited amounts of labeled training data.

PubMed Disclaimer

Conflict of interest statement

E.V., A.B., A.C., G.S., M.Z., P.M., A.v.E., D.L., J.V., E.R., Y.K.W., J.D.K., M.C.H.L., J.H.B., R.A.G., G.O., J.A.R., W.A.M., R.Y., D.K., S.L. and T.J.F. are employees and equity holders of Paige.AI. E.W., M.H., C.K. and B.R. served as consultants for Paige.AI. D.S.K. has received compensation for speaking and consulting from Merck. K.S., E.Z., J.H., N.T. and N.F. are employees of Microsoft. Memorial Sloan Kettering (MSK) maintains financial and intellectual property interests in Paige.AI that are pertinent to the research presented in this manuscript. S.L., E.V., A.B., G.S., M.Z., A.C., J.B., M.L., R.G., T.F. and B.R. are inventors on a provisional US patent (application no. 18/521903) filed corresponding to the methodological aspects of this work. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the study.
The training dataset, training algorithm and application of Virchow, a foundation model for computational pathology. a, The training data can be described in terms of patients, cases, specimens, blocks or slides, as shown. bd, The slide distribution as a function of cancer status (b), surgery (c) and tissue type (d). e, The dataflow during training requires processing the slide into tiles, which are then cropped into global and local views. f, Schematic of applications of the foundation model using an aggregator model to predict attributes at the slide level. GI, gastrointestinal.
Fig. 2
Fig. 2. Virchow enables training a robust pan-cancer detector.
Pan-cancer detection results. Detection is predicted at the specimen level using an aggregator network trained with Virchow, UNI, Phikon or CTransPath tile embeddings as input. a, Cancer detection performance (AUC) stratified by cancer type as determined by origin tissue. The incidence rate and proportion of metastasis of each cancer are shown. Virchow embeddings enable the best cancer detection performance across all cancer types, including rare cancers. For each cancer type, the AUC corresponding to the statistically significantly (P < 0.05) top-performing embeddings is highlighted in magenta. When more than one AUC is not gray, performance is ‘tied’ (no statistically significant difference). The foundation model used to produce tile embeddings for the aggregator is shown in the margin on the left, along with the number of cancer types for which the corresponding aggregator achieved (or tied for) the top AUC. All statistical significance (ae) is computed using the pairwise DeLong’s test for AUC and Cochran’s Q test followed by McNemar’s test for specificity, both corrected for multiple comparisons with Holm’s method. b,c, Cancer detection performance summarized for all cancers (b) and for rare cancers (c). Error bars (be) show the two-sided 95% confidence interval computed with DeLong’s method for AUC and Wilson’s method for specificity; the • denotes the differences that are statistically significant from the rest (P < 0.0001). d, Sensitivity at 95% specificity for rare cancer detection (*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001). e, Virchow-based cancer detection generalizes well to data from external institutions that were not represented in the training set; all aggregators and Virchow were trained only on data from MSKCC. Only half of the specimens in the pan-cancer testing set are from MSKCC. f, One-fifth of the specimens used for pan-cancer model evaluation contained tissues that were not observed in the training sets of Virchow or the pan-cancer aggregators. g, Cancer detection performance scales with the size of the underlying foundation model and the number of training samples (tiles) used to train it. H&N, head and neck.
Fig. 3
Fig. 3. Pan-cancer detection approaches and sometimes surpasses clinical product performance, using less data.
a,b, Performance as measured by AUC of three clinical products compared to the pan-cancer model trained on Virchow embeddings, on the rare variant (a) and product testing datasets (b). The pan-cancer detector, trained on Virchow foundation model embeddings, achieves similar performance to clinical-grade products in general and outperforms them on rare variants of cancers. c, The pan-cancer detector was trained on fewer labeled specimens than the Prostate, Breast and BLN clinical models, including a small fraction of the prostate (teal), breast (blue) and BLN (yellow) tissue specimens that these clinical models were respectively trained on. d, A categorization of failure models of the pan-cancer model and four canonical examples of the primary types of failures. In all panels, * is used to indicate pairwise statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001; pairwise DeLong’s test). Error bars denote the two-sided 95% confidence interval, estimated with DeLong’s method. C., carcinoma. Inv., invasive.
Fig. 4
Fig. 4. Biomarker prediction results.
a, Virchow embeddings help predict biomarkers directly from H& E slides, reducing the need for targeted sequencing or IHC staining. b, The fraction of positive cases in each biomarker testing dataset. c, The number of biomarkers on which using Virchow, UNI, Phikon or CTransPath embeddings to train an aggregator produced an AUC in the top x. This ranking does not consider statistical significance across models for each biomarker due to low statistical power; instead, it relies on considering the ranking across many biomarkers. d, Biomarker detection performance as measured by AUC using aggregator networks trained on embeddings from Virchow, Phikon or CTransPath. For each prediction task, the top scoring embeddings are marked with a colored circle next to the biomarker label, below the plot (this corresponds to the top-1 ranking in c). The error bars denote the two-sided 95% confidence interval computed from 1,000 bootstrapping iterations. Endomet., endometrial.
Fig. 5
Fig. 5. A summary of tile-level linear probing.
a, A description of each tile-level benchmark (top) along with the corresponding results for the embeddings of different foundation models (bottom). For each task, the top result is bolded and highlighted in magenta. Multiple results are highlighted when there is no statistically significant difference between them (P < 0.05; McNemar’s test). Error bars denote two-sided 95% confidence intervals computed using 1,000 bootstrapping iterations. b, The number of tasks in which each model scored in the top x. Models can tie for a rank depending on statistical significance (P < 0.05). c, Virchow embedding features learn meaningful structures. Cells in the CoNSeP dataset highlighted by embedding principal components: malignant epithelium (red), miscellaneous (yellow) and inflammatory (magenta).
Extended Data Fig. 1
Extended Data Fig. 1. Schematic of the Agata aggregator.
The Agata aggregator learns to attend to tiles that contribute toward the label decision using cross-attention. The operation is defined using query Q, key K, and value matrix V: softmax(QKTdk)V, where dk is the output dimension of the key matrix. In contrast to the typical self-attention mechanism where Q, K, V are projected from the inputs, Q is parameterized directly by the model to reduce GPU memory consumption. When aggregating across the tens or hundreds of thousands of tiles in a specimen, full attention requires too much GPU memory. This simplified attention can be interpreted as a learned weighted sum of all tile-level features. Indeed, full self-attention is quadratic in memory with respect to the number of tiles whereas the attention in Agata is linear. K and V are obtained with two consecutive Gaussian Error Linear Unit (GELU) projection layers as: K=GELU(W1Tx+b1) and V=GELU(W2TK+b2), where x is the tile embedding, and Wn, bn are the weight and bias parameters for the projection layers. In our experiments, W1 produces 256-dimensional keys, W2 produces 512-dimensional values, and we omit scaling by {dk}=16. After the attention step, two linear layers with non-linear activation (ReLU) are used followed by a final linear layer with softmax activation.
Extended Data Fig. 2
Extended Data Fig. 2. Schematic of the DINOv2 training routine.
Virchow used a ViT-H architecture, trained with DINOv2. From a single tile, 2 global crops and 8 local crops all with random augmentations are created. The global crops are randomly masked and fed to the student model, and the unmasked versions are fed to the teacher model. The student tries to produce a global representation of the views (via the CLS token) that matches the teacher’s representation of the opposite view. The student also tries to produce representations of the masked image tokens that match the teacher’s representations of the same tokens but unmasked. The local crops are only fed to the student which tries to produce a representation that matches the teacher’s representations of the global crops. The teacher is an exponential moving average (EMA) copy of the student.
Extended Data Fig. 3
Extended Data Fig. 3. Pan-cancer detection results for every cancer origin site.
a. Area under (the receiver operator characteristic) curve (AUC); b. specificity at 95% sensitivity. H&N is head and neck. In both plots, a pairwise comparison of statistical significance is computed using the pairwise DeLong’s test for AUC and Cochran’s Q test followed by McNemar’s test for specificity, both corrected for multiple comparisons with Holm’s method (* p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001). Error bars show the two-sided 95% confidence interval computed with DeLong’s method for AUC and Wilson’s method for specificity.
Extended Data Fig. 4
Extended Data Fig. 4. Pan-cancer dataset distribution.
a. Specimen counts per cancer origin site in the pan-cancer testing dataset (H&N is head and neck). b. Specimen counts per tissue type in the pan-cancer aggregator training dataset.

References

    1. Deng, S. et al. Deep learning in digital pathology image analysis: a survey. Front. Med.14, 470–487 (2020). - PubMed
    1. Srinidhi, C. L., Ciga, O. & Martel, A. L. Deep neural network models for computational histopathology: a survey. Med. Image Anal.67, 101813 (2021). - PMC - PubMed
    1. Cooper, M., Ji, Z. & Krishnan, R. G. Machine learning in computational histopathology: challenges and opportunities. Genes Chromosomes Cancer62, 540–556 (2023). - PubMed
    1. Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng.1, 930–949 (2023).
    1. Fuchs, T. J. & Buhmann, J. M. Computational pathology: challenges and promises for tissue analysis. Comput. Med. Imaging Graph.35, 515–530 (2011). - PubMed

Substances

LinkOut - more resources