Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan;31(1):315-322.
doi: 10.1038/s41591-024-03301-2. Epub 2024 Nov 4.

Clinical validation of an AI-based pathology tool for scoring of metabolic dysfunction-associated steatohepatitis

Affiliations

Clinical validation of an AI-based pathology tool for scoring of metabolic dysfunction-associated steatohepatitis

Hanna Pulaski et al. Nat Med. 2025 Jan.

Abstract

Metabolic dysfunction-associated steatohepatitis (MASH) is a major cause of liver-related morbidity and mortality, yet treatment options are limited. Manual scoring of liver biopsies, currently the gold standard for clinical trial enrollment and endpoint assessment, suffers from high reader variability. This study represents the most comprehensive multisite analytical and clinical validation of an artificial intelligence (AI)-based pathology system, AI-based measurement of metabolic dysfunction-associated steatohepatitis (AIM-MASH), to assist pathologists in MASH trial histology scoring. AIM-MASH demonstrated high repeatability and reproducibility compared to manual scoring. AIM-MASH-assisted reads by expert MASH pathologists were superior to unassisted reads in accurately assessing inflammation, ballooning, MAS ≥ 4 with ≥1 in each score category and MASH resolution, while maintaining non-inferiority in steatosis and fibrosis assessment. These findings suggest that AIM-MASH could mitigate reader variability, providing a more reliable assessment of therapeutics in MASH clinical trials.

PubMed Disclaimer

Conflict of interest statement

Competing interests: H.P., H.H., A.S.-M., R.E., N.P., A.H.B. and N.P.A. are full-time, salaried employees of PathAI. K.E.W. was a full-time employee of PathAI during all phases of the study and is now a paid consultant of PathAI. S.A.H. is a paid consultant for Akero Therapeutics, Aligos Therapeutics, Altimmune, Boehringer Ingelheim, Bluejay Therapeutics, Echosens North America, Galecto, Gilead Sciences, GlaxoSmithKline, Hepion Pharmaceuticals, Hepta Bio, HistoIndex, Kriya Therapeutics, Madrigal Pharmaceuticals, Medpace, MGGM Therapeutics, NeuroBo Pharmaceuticals, Northsea Therapeutics, Novo Nordisk, Pfizer, Sagimet Biosciences, Terns and Viking Therapeutics and a shareholder of Akero, Cirius Therapeutics, Galectin Therapeutics, HistoIndex and Northsea Therapeutics. S.S.M., M.C.V., L.C.M., S.P.M.C., S.H.M., C.E.T. and M.C.M. were PathAI employees at the time of study conduct. J.G. and M.R. are paid contractors of PathAI. R.P.M. and G.M.S. are full-time, salaried employees of OrsoBio. C.C. is a full-time, salaried employee of Inipharm. S.D.P. is a full-time salaried employee of Gilead Sciences. A.-S.S. is a full-time, salaried employee of Novo Nordisk. A.M. was a paid consultant for Bristol Myers Squibb. V.B. is a full-time, salaried employee of Bristol Myers Squibb. A.J.S. has stock options in Genfit, Akarna, Tiziana, Indalo, Durect Inversago and Galmed; is a consultant to AstraZeneca, Nimbus, Takeda, Janssen, Gilead, Terns, Merck, Boehringer Ingelheim, Bristol Myers Squibb, Lilly, Novartis, Novo Nordisk, Pfizer and Genfit; and has been an unpaid consultant to Intercept, Echosens, Immuron, Galectin and Affimune Prosciento. His institution has received grant support from Gilead, Bristol Myers Squibb, Intercept, Merck, AstraZeneca and Novartis. He receives royalties from Elsevier and UptoDate. Q.M.A. is a coordinator of the EU IMI-2 LITMUS consortium, which is funded by the EU Horizon 2020 program and the EFPIA. This multistakeholder consortium includes industry partners. He has research grant funding from AstraZeneca, Boehringer Ingelheim and Intercept. He is a consultant on behalf of Newcastle University to Alimentiv, Akero, AstraZeneca, 89bio, Boehringer Ingelheim, Bristol Myers Squibb, Galmed, Genfit, Genentech, Gilead, GlaxoSmithKline, HistoIndex, Intercept, Inventiva, QVIA, Janssen, Madrigal, Merck, NGM Bio, Novartis, Novo Nordisk, PathAI, Pfizer, PharmaNest, Prosciento, Roche and Terns. He is a speaker for Novo Nordisk, Madrigal and Springer Healthcare and receives royalties from Elsevier. R.L. is a consultant to Aardvark Therapeutics, Altimmune, Alnylam–Regeneron, Amgen, Arrowhead Pharmaceuticals, AstraZeneca, Bluejay Therapeutics, Bristol Myers Squibb, Eli Lilly, Galmed, Gilead, Inipharma, Intercept, Inventiva, Ionis, Janssen, Madrigal, NGM Biopharmaceuticals, Novartis, Novo Nordisk, Merck, Pfizer, Sagimet, Theratechnologies, 89bio, Terns Pharmaceuticals and Viking Therapeutics. He is a cofounder of LipoNexus. V.R. is a paid consultant for Novo Nordisk, Northsea Madrigal, Enyo, Poxel, Bristol Myers Squibb, Intercept, NGM Bio and Sagimet.

Figures

Fig. 1
Fig. 1. AI-assisted workflow with representative AIM-MASH overlays and GT panel workflows.
a, In the AI-assisted workflow, the primary pathologist reviews the AIM-MASH output and does a quality control (QC) review of the Hematoxylin and Eosin (H&E) and Masson's Trichrome (MT) slides (determines whether restaining or rescanning of the slide is necessary, confirms that all trial-specific criteria are met and notes any additional findings). If the primary pathologist disagrees with any MASH component(s) by two points or more, the case goes to a review by a secondary pathologist, who independently reviews the discordant AIM-MASH score(s). If the secondary pathologist agrees with the primary pathologist’s modified score, this will be the final score; if they disagree with the primary pathologist or agree with AIM-MASH, the two pathologists will convene on a consensus call in which they agree on the final score. b, Consensus GT for each biopsy was determined by one of two panels of hepatopathologists. Each panel consisted of two main reader pathologists and an auxiliary tiebreaker pathologist. Discrepancies in scoring among the primary readers prompted the intervention of the tiebreaker pathologist, who was blind to initial assessments. When the tiebreaker’s scoring diverged from that of both primary readers, a panel discussion was convened for consensus, with the tiebreaker’s score being decisive in rare cases of continued disagreement. c, For the median GT score, when the tiebreaker’s scoring diverged from that of both primary readers, the median of the three scores was considered final. Overall, five distinct pathologists contributed to establishing the GT.
Fig. 2
Fig. 2. Scanner repeatability and reproducibility of AIM-MASH.
a, For scanner repeatability, a subset of 150 cases from the clinical validation were scanned multiple times using the same Leica Aperio AT2 scanner at ×40 magnification on three nonconsecutive days (intrasite, interscan). b, For scanner reproducibility, the same slides were scanned once at three different laboratories by three different operators using three different Leica Aperio AT2 scanners at ×40 magnification (intersite). Bootstrap percentile P values showing statistical significance for the one-sided hypothesis that the mean agreement rate between algorithm scores for each scan is greater than 0.85 are as follows: ***P < 0.0001; **P < 0.01; *P < 0.05; not significant (NS), P ≥ 0.05. Whiskers show the 95% CIs for mean agreement rate estimated using 2,000 bootstraps. Dashed lines indicate 85% agreement.
Fig. 3
Fig. 3. Accuracy concordance comparison of MASH histologic components and comparisons for MASH aggregate component scores (F2 and F3 versus other and NAS ≥ 4 with ≥1 in each score category versus other) and MASH resolution.
a,b, Accuracy comparison, based on linearly WK, between AIM-MASH (without pathology review) versus GT and IMR versus GT in a and between AI assisted (AIM-MASH with pathology review) versus GT and IMR versus GT in b for MASH components. c, Accuracy comparison, based on kappa, between AI assisted versus GT and IMR versus GT for aggregate components relevant to clinical trial enrollment and endpoint criteria, including the score-based enrollment requirement, MAS ≥ 4 with a score of at least one for each component, fibrosis score of 2 or 3, and the NASH resolution endpoint, defined as a ballooning score of 0, a lobular inflammation score of 0 or 1 and any score for steatosis. Point estimates are shown on top of each bar, with whiskers representing the 95% CIs estimated from 2,000 bootstrap samples. Non-inferiority (NI) was assessed using bootstrap percentile P values for testing the one-sided hypothesis that the LB of the 95% CIs of the difference in AIM-NASH versus GT or AI assisted versus GT and IMR versus GT is not smaller than −0.1. S (superiority) was assessed by testing the one-sided hypothesis that the LB of the difference is greater than 0. ***P < 0.0001; **P < 0.01; *P < 0.05; NS, P ≥ 0.05. ‘+’ in c indicates aggregate components where the LB of the 95% CIs for AI assisted versus GT kappa is greater than the upper bound of the IMR versus GT kappa.
Fig. 4
Fig. 4. WK analysis for MASH components AI assisted and median panel comparisons.
The same cohort of 1,481 cases used in analytical and clinical validation was used to determine the accuracy of AI-assisted reads against two panels of readers. Median GT (panel 1, using median scores, described in Fig. 1c), instead of panel calls for consensus and median IMR (panel 2), derived from a minimum of three IMRs, was determined. AI-assisted scores for each component met the non-inferiority performance criteria described in Statistical analysis (Methods). Superiority was not observed for any of the components. Whiskers represent 95% CIs estimated using 2,000 bootstrap samples. ***P < 0.0001; **P < 0.01; *P < 0.05; NS, P ≥ 0.05.

Similar articles

Cited by

References

    1. Rinella, M. E. et al. A multisociety Delphi consensus statement on new fatty liver disease nomenclature. J. Hepatol.79, 1542–1556 (2023). - PubMed
    1. Younossi, Z. M. et al. Global epidemiology of nonalcoholic fatty liver disease—meta-analytic assessment of prevalence, incidence, and outcomes. Hepatology64, 73–84 (2016). - PubMed
    1. Noureddin, M. et al. NASH leading cause of liver transplant in women: updated analysis of indications for liver transplant and ethnic and gender variances. Am. J. Gastroenterol.113, 1649–1659 (2018). - PMC - PubMed
    1. Friedman, S. L., Neuschwander-Tetri, B. A., Rinella, M. & Sanyal, A. J. Mechanisms of NAFLD development and therapeutic strategies. Nat. Med.24, 908–922 (2018). - PMC - PubMed
    1. FDA–NIH Biomarker Working Group. BEST (Biomarkers, Endpoints, and other Tools) Resource (2016); https://www.ncbi.nlm.nih.gov/books/NBK326791/ - PubMed

Publication types