Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov;31(11):3749-3761.
doi: 10.1038/s41591-025-03982-3. Epub 2025 Nov 5.

A multimodal whole-slide foundation model for pathology

Affiliations

A multimodal whole-slide foundation model for pathology

Tong Ding et al. Nat Med. 2025 Nov.

Abstract

The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning. However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose Transformer-based pathology Image and Text Alignment Network (TITAN), a multimodal whole-slide foundation model pretrained using 335,645 whole-slide images via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any fine-tuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that it outperforms both ROI and slide foundation models across machine learning settings, including linear probing, few-shot and zero-shot classification, rare cancer retrieval, cross-modal retrieval and pathology report generation.

PubMed Disclaimer

Conflict of interest statement

Competing interests: R.J.C., M.Y.L., D.F.K.W., B.C., L.P.L. and F.M. hold equity interests in ModellaAI. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of TITAN.
a, Tissue site distribution of Mass-340K used for TITANV pretraining (stage 1). Mass-340K includes 335,645 WSIs across 20 organs with a mix of tissue sections stained with H&E (89.7%), IHC (7.9%), special stains (2.3%) and others (0.1%) or a mix of neoplastic (70.0%), tissue damage response (8.4%), normal (4.7%), inflammatory (3.4%) and others (13.5%) scanned with diverse scanner types. TITAN pretraining (stages 2 and 3) uses a subset of Mass-340K with paired captions and medical reports. bd, Block diagram of TITANV pretraining. b, TITAN uses a ViT to encode a WSI into a slide embedding. c, TITANV (stage 1) is pretrained using SSL with student–teacher knowledge distillation. d, TITAN (stage 2 and 3) is pretrained using vision-language modeling, first by aligning the slide embedding with synthetic captions (stage 2) and then with medical reports (stage 3). e, UMAP visualization of TCGA slide embeddings obtained with TITAN, color-coded by organ. UMAP, uniform manifold approximation and projection; px, pixel.
Fig. 2
Fig. 2. TITAN evaluation.
a, Impact of pretraining data size on TITANV and baselines across four challenging subtyping tasks. TITANV is pretrained with 12.5%, 25%, 50% and 100% of Mass-340K. b, The average performance of the four tasks against the number of parameters. c, Linear probe evaluation of TITAN and baselines on morphological classification, molecular status and survival prediction tasks. The mean pooling baseline uses the same patch encoder as TITAN (CONCHv1.5). Multiclass tasks are evaluated with balanced accuracy, binary tasks with AUROC and survival tasks with the concordance index. For external cohorts (DHMC, CPTAC), the classifier is trained on the corresponding TCGA cohort. All error bars represent s.d. based on bootstrapping (n = 1,000) or k-fold evaluation (k = 5). d, Ablation for positional encoding, number of transformer layers and inclusion of vision-pretraining stage. The performance is averaged across the four subtyping tasks. e, Change in performance of slide encoders averaged across the four subtyping tasks for different learning paradigms. For mean pooling and ABMIL, the respective patch encoder for each framework is used. PRISM fine-tuning is not evaluated as the fine-tuning recipes are not provided. f, Linear probe few-shot performance using K shots, K ∈ {1, 2, 4, 8, 16}, comparing baselines and ABMIL with CONCHv1.5. For each setting, 50 runs were performed. The center of each box plot (horizontal line) represents the median, with whiskers extending to data points within 1.5× the interquartile range. Statistical significance was assessed by fitting generalized linear mixed-effects model and two-sided Wald z test on the fitted model. Significance shown with respect to TITAN. P values for nonsignificant results are shown. **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001. C, number of classes; Ft., fine-tune.
Fig. 3
Fig. 3. Visual-language evaluation of TITAN.
a, A schematic for zero-shot evaluation. The query slide is classified by identifying the closest text prompt embedding in the slide embedding space. b, Zero-shot performance of TITAN and PRISM. All multiclass tasks are evaluated with balanced accuracy and binary tasks are evaluated with AUROC. All error bars represent s.d. based on bootstrapping (n = 1,000). Dashed lines represent average performance for respective models (red, TITAN; teal, PRISM) c, Ablation study comparing different pretraining strategies, and assessed with zero-shot performance averaged across TCGA-UT-8K, TCGA-OT, OT108 and EBRAINS. Evaluations are based on the percentage changes of balanced accuracy from the reference zero-shot performance of TITAN. d, Report-generation evaluation on TCGA-Slide-Reports, and evaluated using METEOR, ROUGE and BLEU. All error bars represent s.d. based on bootstrapping (n = 1,000). e, TCGA examples of generated reports of TITAN and PRISM, with the corresponding clinical reports. Additional examples of generated reports are available in Extended Data Fig. 7. Statistical significance was assessed by fitting a generalized linear mixed-effects model and performing a two-sided Wald z test on the fitted model. Significance shown with respect to TITAN. ****P ≤ 0.0001.
Fig. 4
Fig. 4. Retrieval capabilities of TITAN.
a, Slide retrieval results on rare cancer retrieval tasks assessed with Accuracy@K, with K = {1, 3, 5}. Rare-Cancer (internal rare cancer cohort) consists of TCGA, EBRAINS and the MGB internal cohort, with 43 rare and 143 common cancer types for a total of 186 classes. Rare-Cancer-Public (public rare cancer cohort) consists of TCGA and EBRAINS only, with 29 rare and 98 common cancer types for a total of 127 classes. Rare-Cancer-External consists of 12 rare cancer types for the ovary and soft tissue, curated at Kanagawa Cancer Center Hospital, Japan. b, Example of rare cancer retrieval on Rare-Cancer with the query slide and four representative retrieved slides. The number indicates the cosine similarity between the query and the retrieved slide. Additional examples of rare cancer retrieval are available in Extended Data Fig. 8. c, Slide retrieval results on five subtyping tasks. Mean represents the average performance across three shots. d, Report-to-slide and slide-to-report cross-modal retrieval performance assessed with Recall@K, with K = {1, 3, 5, 10} on TCGA cohort of 10,108 pairs of WSIs and reports for TITAN and PRISM. Mean represents the average performance across four shots. All error bars represent s.d. based on bootstrapping (n = 1,000). Statistical significance was assessed using TITAN by the fitting of a generalized linear mixed-effects model and a two-sided Wald z test on the fitted model. Significance shown with respect to TITAN. P values for nonsignificant results are shown. **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001.
Extended Data Fig. 1
Extended Data Fig. 1. Examples of TCGA-UT-8K dataset.
Examples of TCGA-UT-8K, which are ROIs of 8,192 × 8,192 pixels selected by the pathologists. The green contours illustrate the cancer region annotations, with the red number indicating the ROI index within a given TCGA slide.
Extended Data Fig. 2
Extended Data Fig. 2. Linear probe results for molecular classification tasks.
(a) Linear models are fitted and evaluated on binary molecular status predictions for BCNB and MUT-HET. We observe that TITAN consistently performs best with + 0.9% on BCNB and MUT-HET, +1.7% on TCGA, and +3.7% on internal molecular classification of BRCA and LUAD, in averaged AUROC scores over the next best model PRISM. (b) Linear models are fitted and evaluated on five-fold splits on TCGA. (c) The same models are evaluated on the corresponding external datasets from CPTAC and EBRAINS. (d) 6-level ER and PR prediction from Mass General Hospital (MGH) and 3-level PD-L1 prediction, all from immunohistochemistry (IHC) slides. (e) Molecular classification tasks for BRCA and LUAD from Mass General Brigham (MGB). All error bars represent standard deviations based on bootstrapping (n = 1,000) or k-fold evaluation (k = 5).
Extended Data Fig. 3
Extended Data Fig. 3. UMAP of slide embedding space for TCGA-OT.
UMAP visualization of slide embeddings in TCGA-OT cohort (n = 11,186) for all slide encoder baselines, including TITAN and TITANV, color-coded by different organs for visual decluttering.
Extended Data Fig. 4
Extended Data Fig. 4. UMAP of TCGA-OT slide representations (n = 11,186) from all slide encoders.
The first row is labeled by OncoTreeCode, the second row by OncoTreeSiteCode, and the third row by submission site. Clustering metrics, mean local diversity (mLD), adjusted rand index (ARI), and normalized mutual information (NMI), are computed for all labels. Note that CHIEF includes TCGA in the pretraining dataset.
Extended Data Fig. 5
Extended Data Fig. 5. Attention heatmaps of TITAN.
Exemplar attention heatmaps for three Transformer attention heads of TITAN (head #4, #10, #11) are shown across three different TCGA WSIs. Out of the 12 attention heads, we find that most attention heads focus on dense tumor regions, with certain attention heads such as head #10 focusing on tumor-adjacent stroma and head #11 focusing on non-tumor areas. Across different cancer types, while head #11 attends to tissue-specific morphologies such as peritumoral stroma in the thymoma WSI and the tumor-adjacent stroma and ducts in the BRCA WSI, we do observe that general morphological patterns such as tumor/non-tumor are conserved across tissue types.
Extended Data Fig. 6
Extended Data Fig. 6. Ablation experiments on different learning paradigms.
Change in balanced accuracy performance for several learning paradigms on four subtyping tasks with respect to the linear probe. The baselines include mean pooling, ABMIL, linear probe, and finetuned from pretrained or randomly initialized weights. The number under each task name indicates the linear probe performance. TITAN-L represents the variation of TITAN without vision-pretraining. For mean pooling and ABMIL, we use the respective patch encoder for each framework, as specified under each slide encoder name. Finetuning results are not provided for PRISM, as the finetuning recipes were not available.
Extended Data Fig. 7
Extended Data Fig. 7. Examples of generated reports.
TCGA examples of generated reports of TITAN and PRISM, with the corresponding clinical reports.
Extended Data Fig. 8
Extended Data Fig. 8. Rare cancer retrieval with TITAN.
(a)–(c) Examples of slide retrieval on Rare-Cancer. The number for each retrieved slide represents the cosine similarity between the query and the retrieved slide. The retrieved slides with high similarity are either of the same diagnostic label or from the same organ as the query slide. (a) Thyroid (THFO) query (b) Pleura (PLBMESO) query (c) Adrenal gland (ACC) query.

References

    1. Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng.1, 930–949 (2023).
    1. Riasatian, A. et al. Fine-tuning and training of DenseNet for histopathology image representation using TCGA diagnostic slides. Med. Image Anal.70, 102032 (2021). - PubMed
    1. Ciga, O., Xu, T. & Martel, A. L. Self supervised contrastive learning for digital histopathology. Mach. Learn. Appl.7, 100198 (2022).
    1. Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal.81, 102559 (2022). - PubMed
    1. Wang, X. et al. RetCCL: clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal.83, 102645 (2023). - PubMed

LinkOut - more resources