. 2025 Nov;31(11):3749-3761.

doi: 10.1038/s41591-025-03982-3. Epub 2025 Nov 5.

A multimodal whole-slide foundation model for pathology

Tong Ding^#^{1

2

3

4}, Sophia J Wagner^#^{1

5

6}, Andrew H Song^#^{1

2

3}, Richard J Chen^#^{1

2

3}, Ming Y Lu^{1

2

3

7}, Andrew Zhang^{1

2

3

8}, Anurag J Vaidya^{1

2

3

8}, Guillaume Jaume^{1

2

3}, Muhammad Shaban^{1

2

3}, Ahrong Kim^{1

9}, Drew F K Williamson¹⁰, Harry Robertson^{1

11}, Bowen Chen^{1

2

3}, Cristina Almagro-Pérez^{1

2

3

8}, Paul Doucet^{1

2

3}, Sharifa Sahai^{1

2

3

12}, Chengkuan Chen^{1

2

3}, Christina S Chen^{1

13}, Daisuke Komura¹⁴, Akihiro Kawabe¹⁴, Mieko Ochi¹⁴, Shinya Sato¹⁵, Tomoyuki Yokose¹⁵, Yohei Miyagi¹⁶, Shumpei Ishikawa^{14

17}, Georg Gerber¹, Tingying Peng^{5

6}, Long Phi Le^{18

19}, Faisal Mahmood^{20

21

22

23}

Affiliations

¹ Department of Pathology, Mass General Brigham, Harvard Medical School, Boston, MA, USA.
² Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
³ Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA.
⁴ John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA.
⁵ Helmholtz Munich-German Research Center for Environment and Health, Munich, Germany.
⁶ School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
⁷ Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁸ Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁹ Department of Pathology, Pusan National University, Busan, South Korea.
¹⁰ Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, GA, USA.
¹¹ Sydney Precision Data Science Center, The University of Sydney, Camperdown, New South Wales, Australia.
¹² Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
¹³ Department of Mechanical Engineering, University of Maryland, College Park, MD, USA.
¹⁴ Department of Preventive Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
¹⁵ Department of Pathology, Kanagawa Cancer Center Hospital, Kanagawa, Japan.
¹⁶ Molecular Pathology and Genetics Division, Kanagawa Cancer Center Research Institute, Kanagawa, Japan.
¹⁷ Division of Pathology, National Cancer Center, Exploratory Oncology Research & Clinical Trial Center, Chiba, Japan.
¹⁸ Department of Pathology, Mass General Brigham, Harvard Medical School, Boston, MA, USA. long.le@mgh.harvard.edu.
¹⁹ Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA, USA. long.le@mgh.harvard.edu.
²⁰ Department of Pathology, Mass General Brigham, Harvard Medical School, Boston, MA, USA. faisalmahmood@bwh.harvard.edu.
²¹ Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA. faisalmahmood@bwh.harvard.edu.
²² Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA. faisalmahmood@bwh.harvard.edu.
²³ Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA. faisalmahmood@bwh.harvard.edu.

^# Contributed equally.

PMID: 41193692
PMCID: PMC12618242
DOI: 10.1038/s41591-025-03982-3

A multimodal whole-slide foundation model for pathology

Tong Ding et al. Nat Med. 2025 Nov.

. 2025 Nov;31(11):3749-3761.

doi: 10.1038/s41591-025-03982-3. Epub 2025 Nov 5.

Authors

Affiliations

¹ Department of Pathology, Mass General Brigham, Harvard Medical School, Boston, MA, USA.
² Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
³ Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA.
⁴ John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA.
⁵ Helmholtz Munich-German Research Center for Environment and Health, Munich, Germany.
⁶ School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
⁷ Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁸ Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁹ Department of Pathology, Pusan National University, Busan, South Korea.
¹⁰ Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, GA, USA.
¹¹ Sydney Precision Data Science Center, The University of Sydney, Camperdown, New South Wales, Australia.
¹² Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
¹³ Department of Mechanical Engineering, University of Maryland, College Park, MD, USA.
¹⁴ Department of Preventive Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
¹⁵ Department of Pathology, Kanagawa Cancer Center Hospital, Kanagawa, Japan.
¹⁶ Molecular Pathology and Genetics Division, Kanagawa Cancer Center Research Institute, Kanagawa, Japan.
¹⁷ Division of Pathology, National Cancer Center, Exploratory Oncology Research & Clinical Trial Center, Chiba, Japan.
¹⁸ Department of Pathology, Mass General Brigham, Harvard Medical School, Boston, MA, USA. long.le@mgh.harvard.edu.
¹⁹ Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA, USA. long.le@mgh.harvard.edu.
²⁰ Department of Pathology, Mass General Brigham, Harvard Medical School, Boston, MA, USA. faisalmahmood@bwh.harvard.edu.
²¹ Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA. faisalmahmood@bwh.harvard.edu.
²² Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA. faisalmahmood@bwh.harvard.edu.
²³ Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA. faisalmahmood@bwh.harvard.edu.

^# Contributed equally.

PMID: 41193692
PMCID: PMC12618242
DOI: 10.1038/s41591-025-03982-3

Abstract

The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning. However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose Transformer-based pathology Image and Text Alignment Network (TITAN), a multimodal whole-slide foundation model pretrained using 335,645 whole-slide images via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any fine-tuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that it outperforms both ROI and slide foundation models across machine learning settings, including linear probing, few-shot and zero-shot classification, rare cancer retrieval, cross-modal retrieval and pathology report generation.

PubMed Disclaimer

Conflict of interest statement

Competing interests: R.J.C., M.Y.L., D.F.K.W., B.C., L.P.L. and F.M. hold equity interests in ModellaAI. The other authors declare no competing interests.

Figures

**Fig. 1. Overview of TITAN.**
a, Tissue site distribution of Mass-340K used for TITAN_V pretraining (stage 1). Mass-340K includes 335,645 WSIs across 20 organs with a mix of tissue sections stained with H&E (89.7%), IHC (7.9%), special stains (2.3%) and others (0.1%) or a mix of neoplastic (70.0%), tissue damage response (8.4%), normal (4.7%), inflammatory (3.4%) and others (13.5%) scanned with diverse scanner types. TITAN pretraining (stages 2 and 3) uses a subset of Mass-340K with paired captions and medical reports. b–d, Block diagram of TITAN_V pretraining. b, TITAN uses a ViT to encode a WSI into a slide embedding. c, TITAN_V (stage 1) is pretrained using SSL with student–teacher knowledge distillation. d, TITAN (stage 2 and 3) is pretrained using vision-language modeling, first by aligning the slide embedding with synthetic captions (stage 2) and then with medical reports (stage 3). e, UMAP visualization of TCGA slide embeddings obtained with TITAN, color-coded by organ. UMAP, uniform manifold approximation and projection; px, pixel.

**Fig. 2. TITAN evaluation.**
a, Impact of pretraining data size on TITAN_V and baselines across four challenging subtyping tasks. TITAN_V is pretrained with 12.5%, 25%, 50% and 100% of Mass-340K. b, The average performance of the four tasks against the number of parameters. c, Linear probe evaluation of TITAN and baselines on morphological classification, molecular status and survival prediction tasks. The mean pooling baseline uses the same patch encoder as TITAN (CONCHv1.5). Multiclass tasks are evaluated with balanced accuracy, binary tasks with AUROC and survival tasks with the concordance index. For external cohorts (DHMC, CPTAC), the classifier is trained on the corresponding TCGA cohort. All error bars represent s.d. based on bootstrapping (n = 1,000) or k-fold evaluation (k = 5). d, Ablation for positional encoding, number of transformer layers and inclusion of vision-pretraining stage. The performance is averaged across the four subtyping tasks. e, Change in performance of slide encoders averaged across the four subtyping tasks for different learning paradigms. For mean pooling and ABMIL, the respective patch encoder for each framework is used. PRISM fine-tuning is not evaluated as the fine-tuning recipes are not provided. f, Linear probe few-shot performance using K shots, K ∈ {1, 2, 4, 8, 16}, comparing baselines and ABMIL with CONCHv1.5. For each setting, 50 runs were performed. The center of each box plot (horizontal line) represents the median, with whiskers extending to data points within 1.5× the interquartile range. Statistical significance was assessed by fitting generalized linear mixed-effects model and two-sided Wald z test on the fitted model. Significance shown with respect to TITAN. P values for nonsignificant results are shown. **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001. C, number of classes; Ft., fine-tune.

**Fig. 3. Visual-language evaluation of TITAN.**
a, A schematic for zero-shot evaluation. The query slide is classified by identifying the closest text prompt embedding in the slide embedding space. b, Zero-shot performance of TITAN and PRISM. All multiclass tasks are evaluated with balanced accuracy and binary tasks are evaluated with AUROC. All error bars represent s.d. based on bootstrapping (n = 1,000). Dashed lines represent average performance for respective models (red, TITAN; teal, PRISM) c, Ablation study comparing different pretraining strategies, and assessed with zero-shot performance averaged across TCGA-UT-8K, TCGA-OT, OT108 and EBRAINS. Evaluations are based on the percentage changes of balanced accuracy from the reference zero-shot performance of TITAN. d, Report-generation evaluation on TCGA-Slide-Reports, and evaluated using METEOR, ROUGE and BLEU. All error bars represent s.d. based on bootstrapping (n = 1,000). e, TCGA examples of generated reports of TITAN and PRISM, with the corresponding clinical reports. Additional examples of generated reports are available in Extended Data Fig. 7. Statistical significance was assessed by fitting a generalized linear mixed-effects model and performing a two-sided Wald z test on the fitted model. Significance shown with respect to TITAN. ****P ≤ 0.0001.

**Fig. 4. Retrieval capabilities of TITAN.**
a, Slide retrieval results on rare cancer retrieval tasks assessed with Accuracy@K, with K = {1, 3, 5}. Rare-Cancer (internal rare cancer cohort) consists of TCGA, EBRAINS and the MGB internal cohort, with 43 rare and 143 common cancer types for a total of 186 classes. Rare-Cancer-Public (public rare cancer cohort) consists of TCGA and EBRAINS only, with 29 rare and 98 common cancer types for a total of 127 classes. Rare-Cancer-External consists of 12 rare cancer types for the ovary and soft tissue, curated at Kanagawa Cancer Center Hospital, Japan. b, Example of rare cancer retrieval on Rare-Cancer with the query slide and four representative retrieved slides. The number indicates the cosine similarity between the query and the retrieved slide. Additional examples of rare cancer retrieval are available in Extended Data Fig. 8. c, Slide retrieval results on five subtyping tasks. Mean represents the average performance across three shots. d, Report-to-slide and slide-to-report cross-modal retrieval performance assessed with Recall@K, with K = {1, 3, 5, 10} on TCGA cohort of 10,108 pairs of WSIs and reports for TITAN and PRISM. Mean represents the average performance across four shots. All error bars represent s.d. based on bootstrapping (n = 1,000). Statistical significance was assessed using TITAN by the fitting of a generalized linear mixed-effects model and a two-sided Wald z test on the fitted model. Significance shown with respect to TITAN. P values for nonsignificant results are shown. **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001.

**Extended Data Fig. 1. Examples of TCGA-UT-8K dataset.**
Examples of TCGA-UT-8K, which are ROIs of 8,192 × 8,192 pixels selected by the pathologists. The green contours illustrate the cancer region annotations, with the red number indicating the ROI index within a given TCGA slide.

**Extended Data Fig. 2. Linear probe results for molecular classification tasks.**
(a) Linear models are fitted and evaluated on binary molecular status predictions for BCNB and MUT-HET. We observe that TITAN consistently performs best with + 0.9% on BCNB and MUT-HET, +1.7% on TCGA, and +3.7% on internal molecular classification of BRCA and LUAD, in averaged AUROC scores over the next best model PRISM. (b) Linear models are fitted and evaluated on five-fold splits on TCGA. (c) The same models are evaluated on the corresponding external datasets from CPTAC and EBRAINS. (d) 6-level ER and PR prediction from Mass General Hospital (MGH) and 3-level PD-L1 prediction, all from immunohistochemistry (IHC) slides. (e) Molecular classification tasks for BRCA and LUAD from Mass General Brigham (MGB). All error bars represent standard deviations based on bootstrapping (n = 1,000) or k-fold evaluation (k = 5).

**Extended Data Fig. 3. UMAP of slide embedding space for TCGA-OT.**
UMAP visualization of slide embeddings in TCGA-OT cohort (n = 11,186) for all slide encoder baselines, including TITAN and TITANV, color-coded by different organs for visual decluttering.

**Extended Data Fig. 4. UMAP of TCGA-OT slide representations (n = 11,186) from all slide encoders.**
The first row is labeled by OncoTreeCode, the second row by OncoTreeSiteCode, and the third row by submission site. Clustering metrics, mean local diversity (mLD), adjusted rand index (ARI), and normalized mutual information (NMI), are computed for all labels. Note that CHIEF includes TCGA in the pretraining dataset.

**Extended Data Fig. 5. Attention heatmaps of TITAN.**
Exemplar attention heatmaps for three Transformer attention heads of TITAN (head #4, #10, #11) are shown across three different TCGA WSIs. Out of the 12 attention heads, we find that most attention heads focus on dense tumor regions, with certain attention heads such as head #10 focusing on tumor-adjacent stroma and head #11 focusing on non-tumor areas. Across different cancer types, while head #11 attends to tissue-specific morphologies such as peritumoral stroma in the thymoma WSI and the tumor-adjacent stroma and ducts in the BRCA WSI, we do observe that general morphological patterns such as tumor/non-tumor are conserved across tissue types.

**Extended Data Fig. 6. Ablation experiments on different learning paradigms.**
Change in balanced accuracy performance for several learning paradigms on four subtyping tasks with respect to the linear probe. The baselines include mean pooling, ABMIL, linear probe, and finetuned from pretrained or randomly initialized weights. The number under each task name indicates the linear probe performance. TITAN-L represents the variation of TITAN without vision-pretraining. For mean pooling and ABMIL, we use the respective patch encoder for each framework, as specified under each slide encoder name. Finetuning results are not provided for PRISM, as the finetuning recipes were not available.

**Extended Data Fig. 7. Examples of generated reports.**
TCGA examples of generated reports of TITAN and PRISM, with the corresponding clinical reports.

**Extended Data Fig. 8. Rare cancer retrieval with TITAN.**
(a)–(c) Examples of slide retrieval on Rare-Cancer. The number for each retrieved slide represents the cosine similarity between the query and the retrieved slide. The retrieved slides with high similarity are either of the same diagnostic label or from the same organ as the query slide. (a) Thyroid (THFO) query (b) Pleura (PLBMESO) query (c) Adrenal gland (ACC) query.

See this image and copyright information in PMC

References

1. Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng.1, 930–949 (2023).
1. Riasatian, A. et al. Fine-tuning and training of DenseNet for histopathology image representation using TCGA diagnostic slides. Med. Image Anal.70, 102032 (2021). - PubMed
1. Ciga, O., Xu, T. & Martel, A. L. Self supervised contrastive learning for digital histopathology. Mach. Learn. Appl.7, 100198 (2022).
1. Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal.81, 102559 (2022). - PubMed
1. Wang, X. et al. RetCCL: clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal.83, 102645 (2023). - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

R35 GM138216/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A multimodal whole-slide foundation model for pathology

Affiliations

A multimodal whole-slide foundation model for pathology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical