Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 2;16(1):136.
doi: 10.1038/s41467-024-54970-z.

A deep multiple instance learning framework improves microsatellite instability detection from tumor next generation sequencing

Affiliations

A deep multiple instance learning framework improves microsatellite instability detection from tumor next generation sequencing

John Ziegler et al. Nat Commun. .

Abstract

Microsatellite instability (MSI) is a critical phenotype of cancer genomes and an FDA-recognized biomarker that can guide treatment with immune checkpoint inhibitors. Previous work has demonstrated that next-generation sequencing data can be used to identify samples with MSI-high phenotype. However, low tumor purity, as frequently observed in routine clinical samples, poses a challenge to the sensitivity of existing algorithms. To overcome this critical issue, we developed MiMSI, an MSI classifier based on deep neural networks and trained using a dataset that included low tumor purity MSI cases in a multiple instance learning framework. On a challenging yet representative set of cases, MiMSI showed higher sensitivity (0.895) and auROC (0.971) than MSISensor (sensitivity: 0.67; auROC: 0.907), an open-source software previously validated for clinical use at our institution using MSK-IMPACT large panel targeted NGS data. In a separate, prospective cohort, MiMSI confirmed that it outperforms MSISensor in low purity cases (P = 8.244e-07).

PubMed Disclaimer

Conflict of interest statement

Competing interests: John Ziegler is an employee of MongoDB, New York. Jaclyn F. Hechtman is an employee of Caris Life Sciences and has received consulting fees from Pfizer. Ryan N. Ptashkin is an employee of Natera. Gowtham Jayakumaran is an employee of Guardant Health. Sumit Middha is an employee of Adaptimmune. Shweta S. Chavan is an employee of Repertoire Immune Medicines, Cambridge, MA. Chad Vanderbilt has equity, Intellectual Property Rights, Professional Services and Activities (uncompensated) for Paige.AI. Deborah DeLair is an employee of Northwell Health, Greenvale, NY. Jinru Shia has been engaged in Professional Services and Activities (uncompensated) for Paige.AI. Nicole DeGroat is an employee of Regeneron Pharmaceuticals, Tarrytown, NY. Ryma Benayed is an employee of AstraZeneca, New York. Marc Ladanyi received advisory board compensation from Merck, Bristol-Myers Squibb, Takeda, Bayer, Lilly Oncology, and Paige.AI, and research support from LOXO Oncology and Helsinn Healthcare. Michael F. Berger received consulting fees from Eli Lilly, AstraZeneca, and Paige.AI, grant support from Boundless Bio, and has intellectual property rights in SOPHiA Genetics. Thomas J. Fuchs is the founder, chief scientist, and shareholder of Paige.AI and is an employee of Elli Lilly and Company. A. Rose Brannon has intellectual property rights in SOPHiA Genetics. Ahmet Zehir is an employee of Natera and received honoraria from Illumina. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. MiMSI model design and performance metrics.
a Schematic representation of converting sequencing reads in a genomic region into a vector representation. Reference sequence, along with mapping qualities and CIGAR strings for each read is used in the vectorization after downsampling. The set of vectors for a given sample is passed through the model (see eFigure 1). b Study cohort used for both training the model and testing the performance. c Distribution of MSISensor scores for samples with orthogonal testing performed. d Area under the receiver operator curve (auROC) analysis of the test cohort analyzed with MSISensor and MiMSI at 4 different downsampled coverage levels (100X, 200X, 300X, and 400X). e MSISensor scores and MiMSI probabilities for the test cohort. Colors indicate the orthogonal test status. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Mutational signature analysis for discrepant cases.
Bar chart showing the fraction of mutations explained by a given mutation signature: mismatch-repair deficiency (MMR: Red); error-prone DNA Polymerase ε (POLE: gray) and concurrent MMR and POLE (blue). All other signatures are shown in light gray. Samples shown had a minimum of 10 mutations for signature analysis and were discrepant between orthogonal testing (MMR IHC or MSI PCR) and MiMSI. Labels shown are the MiMSI categorization. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Downsampling microsatellite loci used in classification.
MiMSI classification results after randomly downsampling microsatellite loci were used for the classifier. The average length of confidence intervals (CI) across the 317 samples for a given number of sites used is shown above each figure. Data are presented as MiMSI score +/− 95% CI error bars; MSI-H in red, MSI-Indeterminate in teal, MSS in blue. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Molecular and genomic features of 5037 prospective clinical cancer samples.
a Distribution plots of the tumor mutation burden (TMB) for MMR proficient (MMR-P) versus deficient (MMR-D) as determined by IHC; the MMR-D cases were further broken down by the specific MMR protein lost. Pairwise comparisons were performed using a two-sided Mann-Whitney Wilcoxon test, demonstrating significant difference of TMB between MMR-P (n = 4195) and MMR-D (n = 842) cases (p < 2.2 x 10−16), MLH1 loss (n = 580) vs MSH2 loss (n = 166) (p = 8.4 × 107), and MSH2 loss vs MSH6 loss (n = 60) (p = 0.034). In contrast, there was no difference in TMB between MLH1 loss and MSH6 loss (p = 0.87), MLH1 loss and PMS2 loss (n = 36) (p = 0.96), and MSH6 loss and PMS2 loss (p = 0.62). In the box plots for ac, the central line represents the median; the box corresponds to 25–75% quartiles; the upper whisker extends to the largest value no farther than 1.5 × IQR; and the lower whisker extends from the 25% quartile to the smallest value no farther than 1.5 × IQR. b Distribution plots of indel to SNV ratios based on MMR and/or protein status (n same as in a). Higher indel-to-SNV ratios are common in tumors with inactivated MMR proteins. Pairwise two-sided Mann-Whitney Wilcoxon comparisons upheld significant differences between all groups (MMR-P (n = 2699) vs MMR-D (n = 789) p < 2.2 × 1016, MLH1 (n = 557) vs MSH2 (n = 152) p < 2.2 × 1016, MLH1 vs MSH6 (n = 46) p < 2.2 × 10−16, MLH1 vs PMS2 (n = 36) p = 0.026, MSH2 vs MSH6 p = 3.6 × 1012, MSH6 vs PMS2 p = 8.1 × 108) c Distribution plots of the fraction of the signature analysis driven by the MMR related signatures in the subset of tumors with at least 15 mutations detected (MMR-P n = 167, MMR-D n = 719, MLH1 loss n = 506, MSH2 loss n = 143, MSH6 loss n = 37, PMS2 loss n = 35). Pairwise two-sided Mann-Whitney Wilcoxon comparisons of the MMR signature fractions were significant between MMR-P vs MMR-D (p < 2.2 × 1016), MLH1vs MSH2 (p = 1 × 1011), MLH1 vs MSH6 (p = 4.8 × 105), and MLH1 vs PMS1 (p = 0.0028), but not between either MSH2 or MSH6 and PMS2 (p = 0.44 and p = 0.2, respectively). d MSISensor scores for all tumors in each IHC category, grouped by algorithmic classification (MSI-H red, MSI-Indeterminate Teal, MSS blue, Not reported gray). e MiMSI score for all tumors by IHC category, grouped by updated algorithmic score. (See Supplementary Fig. 3 for MiMSI scores for each IHC category with 95% confidence intervals.). f Sensitivity of MSISensor and MiMSI based on tumor purity as compared to the IHC standard. g Sensitivity of MSISensor (red triangle) and MiMSI (blue square) by cancer type. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Global comparison of MiMSI results with MSISensor.
Comparison of MSISensor scores and MiMSI scores +/− 95% CI error bars for a cohort of 45,112 tumor samples. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. Tumor-only analysis of test cohort.
A subset of the test cohort (142 orthogonally MSS tumors and 132 orthogonally MSI tumors) was analyzed using the MiMSI model with attention mechanism with an unrelated normal comparator (a) and a pooled normal comparator (b). c, d show the same comparison using the MiMSI model without an attention mechanism. Data are presented as MiMSI score +/− 95% CI error bars; MiMSI class MSI-H in red, MSI-Indeterminate in teal; MSS in blue. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Analysis of the WES dataset.
Scatterplot showing the MiMSI scores +/− 95% CI error bars from 581 MSK-IMPACT and WES captured from the same DNA library. Source data are provided as a Source Data file.

References

    1. Le, D. T. et al. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science357, 409–413 (2017). - PMC - PubMed
    1. Center for Drug Evaluation & Research. FDA approves pembrolizumab for adults and children with TMB-H solid tumors. https://www.fda.gov/drugs/drug-approvals-and-databases/fda-approves-pemb... (2020).
    1. Middha, S. et al. Reliable pan-cancer microsatellite instability assessment by using targeted next-generation sequencing data. JCO Precis. Oncol.2017, PO.17.00084 (2017). - PMC - PubMed
    1. Cheng, D. T. et al. Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): A hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology. J. Mol. Diagn.17, 251–264 (2015). - PMC - PubMed
    1. Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med.23, 703–713 (2017). - PMC - PubMed

Publication types

Substances

LinkOut - more resources