Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer

doi:10.1002/cjp2.312

. 2023 May;9(3):223-235.

doi: 10.1002/cjp2.312. Epub 2023 Feb 1.

Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer

Bangwei Guo¹, Xingyu Li², Miaomiao Yang³, Jitendra Jonnagaddala⁴, Hong Zhang², Xu Steven Xu⁵

Affiliations

¹ School of Data Science, University of Science and Technology of China, Hefei, PR China.
² Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei, PR China.
³ Clinical Pathology Center, The First Affiliated Hospital of Anhui Medical University, Hefei, PR China.
⁴ School of Population Health, UNSW Sydney, Kensington, New South Wales, Australia.
⁵ Clinical Pharmacology and Quantitative Science, Genmab Inc., Princeton, NJ, USA.

PMID: 36723384
PMCID: PMC10073932
DOI: 10.1002/cjp2.312

Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer

Bangwei Guo et al. J Pathol Clin Res. 2023 May.

. 2023 May;9(3):223-235.

doi: 10.1002/cjp2.312. Epub 2023 Feb 1.

Authors

Bangwei Guo¹, Xingyu Li², Miaomiao Yang³, Jitendra Jonnagaddala⁴, Hong Zhang², Xu Steven Xu⁵

Affiliations

¹ School of Data Science, University of Science and Technology of China, Hefei, PR China.
² Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei, PR China.
³ Clinical Pathology Center, The First Affiliated Hospital of Anhui Medical University, Hefei, PR China.
⁴ School of Population Health, UNSW Sydney, Kensington, New South Wales, Australia.
⁵ Clinical Pharmacology and Quantitative Science, Genmab Inc., Princeton, NJ, USA.

PMID: 36723384
PMCID: PMC10073932
DOI: 10.1002/cjp2.312

Abstract

Many artificial intelligence models have been developed to predict clinically relevant biomarkers for colorectal cancer (CRC), including microsatellite instability (MSI). However, existing deep learning networks require large training datasets, which are often hard to obtain. In this study, based on the latest Hierarchical Vision Transformer using Shifted Windows (Swin Transformer [Swin-T]), we developed an efficient workflow to predict biomarkers in CRC (MSI, hypermutation, chromosomal instability, CpG island methylator phenotype, and BRAF and TP53 mutation) that required relatively small datasets. Our Swin-T workflow substantially achieved the state-of-the-art (SOTA) predictive performance in an intra-study cross-validation experiment on the Cancer Genome Atlas colon and rectal cancer dataset (TCGA-CRC-DX). It also demonstrated excellent generalizability in cross-study external validation and delivered a SOTA area under the receiver operating characteristic curve (AUROC) of 0.90 for MSI, using the Molecular and Cellular Oncology dataset for training (N = 1,065) and the TCGA-CRC-DX (N = 462) for testing. A similar performance (AUROC = 0.91) was reported in a recent study, using ~8,000 training samples (ResNet18) on the same testing dataset. Swin-T was extremely efficient when using small training datasets and exhibited robust predictive performance with 200-500 training samples. Our findings indicate that Swin-T could be 5-10 times more efficient than existing algorithms for MSI prediction based on ResNet18 and ShuffleNet. Furthermore, the Swin-T models demonstrated their capability in accurately predicting MSI and BRAF mutation status, which could exclude and therefore reduce samples before subsequent standard testing in a cascading diagnostic workflow, in turn reducing turnaround time and costs.

Keywords: Swin Transformer; biomarkers; colorectal cancer; deep learning; digital pathology.

PubMed Disclaimer

Figures

**Figure 1**
The workflow of the data preprocessing and the training process of the DL model. (A) Tiles images of NCT‐CRC‐HE‐100K are downloaded from the publicly available website (https://zenodo.org/record/1214456) to pre‐train a tissue classifier based on Swin‐T. The classifier has excellent performance of classifying tissues (overall accuracy = 96.3%) and detecting tumor tiles (accuracy = 98%) in an external dataset: CRC‐VAL‐HE‐7K. (B) WSIs in the SVS format of the MCO dataset and TCGA dataset are preprocessed to tessellate into nonoverlapping tiles with a size of 512 × 512 pixels. These tiles are then resized to the smaller 224 × 224 pixels tiles and color normalized. The tumor tiles are selected. (C) For each patient, up to 500 tiles are randomly sampled for subsequent experiments. The pre‐trained tissue classifier model in (A) is then fine‐tuned to predict biomarker status of each tile. The probability values of the tiles are averaged to derive the prediction at the patient level. The performance of the models is evaluated in two separate experiments: an intra‐cohort four‐fold cross‐validation and an inter‐cohort external validation.

**Figure 2**
Predictive performance of four‐fold cross‐validation of Swin‐T based prediction of colorectal cancer biomarkers in the TCGA‐CRC‐DX cohort. AUROC plots for prediction of hypermutation (HM), MSI, CING, CIMP, *BRAF* mutation status, and *TP53* mutation status. The true positive rate represents sensitivity and the false positive rate represents 1 − specificity. The red shaded areas represent the SD. The value in the lower right of each plot represents mean AUROC ± SD.

**Figure 3**
Predictive performance of intra‐cohort four‐fold cross‐validation in the MCO cohort and inter‐cohort external validation in the TCGA‐CRC‐DX cohort: MSI, *BRAF* mutation status (*BRAF*), CIMP. (A) AUROC plots for four‐fold cross‐validation in MCO cohort. The red shaded areas represent the SD. The value in the lower right of each plot represents mean AUROC ± SD. (B) AUROC plots for inter‐cohort external validation in TCGA‐CRC‐DX cohort. The red shaded areas represent the 95% confidence interval (CI), calculated by 1,000× bootstrap. The values in the lower right of each plot represent mean AUROC (95% CI).

**Figure 4**
Test statistics for the pre‐screening tool. Test performance of MSI status, *BRAF* mutation, and CIMP status in the TCGA‐CRC‐DX cohorts displayed as patients classified true/false positive/negative by the Swin‐T model based on 95% sensitivity threshold and fixed thresholds (0.25, 0.5, and 0.75).

**Figure 5**
Visualization of the reader study of representative TP (MSI) and TN (MSS) cases. (A–D) Tissue slides for TP cases and signature pathological features identified by the pathologist. (E) Tissue slides for TN cases and signature pathological features identified by the pathologist.

**Figure 6**
Visualization of the reader study of representative misclassified cases. (A–C) Tissue slides for FP cases and potential confounding pathological features and misclassification reasons identified by the pathologist. (D–F) Tissue slides for FN cases and potential confounding pathological features and misclassification reasons identified by the pathologist.

See this image and copyright information in PMC

Cited by

Revolutionizing gastroenterology and hepatology with artificial intelligence: From precision diagnosis to equitable healthcare through interdisciplinary practice.
Chen ZL, Wang C, Wang F. Chen ZL, et al. World J Gastroenterol. 2025 Jun 28;31(24):108021. doi: 10.3748/wjg.v31.i24.108021. World J Gastroenterol. 2025. PMID: 40599184 Free PMC article. Review.
An interpretable deep learning model for detecting BRCA pathogenic variants of breast cancer from hematoxylin and eosin-stained pathological images.
Li Y, Xiong X, Liu X, Wu Y, Li X, Liu B, Lin B, Li Y, Xu B. Li Y, et al. PeerJ. 2024 Oct 28;12:e18098. doi: 10.7717/peerj.18098. eCollection 2024. PeerJ. 2024. PMID: 39484212 Free PMC article.
Deep Gaussian process with uncertainty estimation for microsatellite instability and immunotherapy response prediction from histology.
Park S, Pettigrew MF, Cha YJ, Kim IH, Kim M, Banerjee I, Barnfather I, Clemenceau JR, Jang I, Kim H, Kim Y, Pai RK, Park JH, Samadder NJ, Song KY, Sung JY, Cheong JH, Kang J, Lee SH, Wang SC, Hwang TH. Park S, et al. NPJ Digit Med. 2025 May 19;8(1):294. doi: 10.1038/s41746-025-01580-8. NPJ Digit Med. 2025. PMID: 40389599 Free PMC article.
Development and deployment of a histopathology-based deep learning algorithm for patient prescreening in a clinical trial.
Juan Ramon A, Parmar C, Carrasco-Zevallos OM, Csiszer C, Yip SSF, Raciti P, Stone NL, Triantos S, Quiroz MM, Crowley P, Batavia AS, Greshock J, Mansi T, Standish KA. Juan Ramon A, et al. Nat Commun. 2024 Jun 1;15(1):4690. doi: 10.1038/s41467-024-49153-9. Nat Commun. 2024. PMID: 38824132 Free PMC article.
Pathomics in Gastrointestinal Tumors: Research Progress and Clinical Applications.
Lv C, Wu Y. Lv C, et al. Cureus. 2025 May 29;17(5):e85060. doi: 10.7759/cureus.85060. eCollection 2025 May. Cureus. 2025. PMID: 40452669 Free PMC article. Review.

See all "Cited by" articles

References

1. Kather JN, Pearson AT, Halama N, et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med 2019; 25: 1054–1056. - PMC - PubMed
1. Schmauch B, Romagnoni A, Pronier E, et al. A deep learning model to predict RNA‐Seq expression of tumours from whole slide images. Nat Commun 2020; 11: 1–15. - PMC - PubMed
1. Yamashita R, Long J, Longacre T, et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol 2021; 22: 132–141. - PubMed
1. Bilal M, Raza SEA, Azam A, et al. Development and validation of a weakly supervised deep learning framework to predict the status of molecular pathways and key mutations in colorectal cancer from routine histology images: a retrospective study. Lancet Digit Health 2021; 3: e763–e772. - PMC - PubMed
1. Fu Y, Jung AW, Torne RV, et al. Pan‐cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat Cancer 2020; 1: 800–810. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Kather JN, Pearson AT, Halama N, et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med 2019; 25: 1054–1056. - PMC - PubMed

[2] Kather JN, Pearson AT, Halama N, et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med 2019; 25: 1054–1056. - PMC - PubMed

[3] Schmauch B, Romagnoni A, Pronier E, et al. A deep learning model to predict RNA‐Seq expression of tumours from whole slide images. Nat Commun 2020; 11: 1–15. - PMC - PubMed

[4] Schmauch B, Romagnoni A, Pronier E, et al. A deep learning model to predict RNA‐Seq expression of tumours from whole slide images. Nat Commun 2020; 11: 1–15. - PMC - PubMed

[5] Yamashita R, Long J, Longacre T, et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol 2021; 22: 132–141. - PubMed

[6] Yamashita R, Long J, Longacre T, et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol 2021; 22: 132–141. - PubMed

[7] Bilal M, Raza SEA, Azam A, et al. Development and validation of a weakly supervised deep learning framework to predict the status of molecular pathways and key mutations in colorectal cancer from routine histology images: a retrospective study. Lancet Digit Health 2021; 3: e763–e772. - PMC - PubMed

[8] Bilal M, Raza SEA, Azam A, et al. Development and validation of a weakly supervised deep learning framework to predict the status of molecular pathways and key mutations in colorectal cancer from routine histology images: a retrospective study. Lancet Digit Health 2021; 3: e763–e772. - PMC - PubMed

[9] Fu Y, Jung AW, Torne RV, et al. Pan‐cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat Cancer 2020; 1: 800–810. - PubMed

[10] Fu Y, Jung AW, Torne RV, et al. Pan‐cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat Cancer 2020; 1: 800–810. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer

Affiliations

Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: achieving state-of-the-art predictive performance with fewer data using Swin Transformer

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous