Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan;28(1):154-163.
doi: 10.1038/s41591-021-01620-2. Epub 2022 Jan 13.

Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge

Collaborators, Affiliations

Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge

Wouter Bulten et al. Nat Med. 2022 Jan.

Abstract

Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent validation. With this in mind, we organized the PANDA challenge-the largest histopathology competition to date, joined by 1,290 developers-to catalyze development of reproducible AI algorithms for Gleason grading using 10,616 digitized prostate biopsies. We validated that a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts, fully blinded to the algorithm developers. On United States and European external validation sets, the algorithms achieved agreements of 0.862 (quadratically weighted κ, 95% confidence interval (CI), 0.840-0.884) and 0.868 (95% CI, 0.835-0.900) with expert uropathologists. Successful generalization across different patient populations, laboratories and reference standards, achieved by a variety of algorithmic approaches, warrants evaluating AI-based Gleason grading in prospective clinical trials.

PubMed Disclaimer

Conflict of interest statement

W.B. and H.P. report grants from the Dutch Cancer Society, during the conduct of the present study. J.v.d.L. reports consulting fees from Philips, ContextVision and AbbVie, and grants from Philips, ContextVision and Sectra, outside the submitted work. G.L. reports grants from the Dutch Cancer Society and the NWO, during the conduct of the present study, and grants from Philips Digital Pathology Solutions and personal fees from Novartis, outside the submitted work. M.E. reports grants from Swedish Research Council, Swedish Cancer Society, Swedish eScience Research Center, EIT Health, Karolinska Institutet, Åke Wiberg Foundation and Prostatacancerförbundet. P.Ruusuvuori reports grants from Academy of Finland, Cancer Foundation Finland and ERAPerMed. H.G. has five patents (WO2013EP7425920131120, WO2013EP74270 20131120, WO2018EP52473 20180201, WO2015SE50272 20150311 and WO2013SE50554 20130516) related to prostate cancer diagnostics pending, and has patent applications licensed to A3P Biomedical. M.E. has four patents (WO2013EP74259 20131120, WO2013EP74270 20131120, WO2018EP52473 20180201 and WO2013SE50554 20130516) related to prostate cancer diagnostics pending, and has patent applications licensed to A3P Biomedical. P.-H.C.C., K.N., Y.C., D.F.S., M.D., S.D., F.T., G.S.C., L.P. and C.H.M. are employees of Google LLC and own Alphabet stock, and report several patents granted or pending on machine-learning models for medical images. M.B.A. reported receiving personal fees from Google LLC during the conduct of the present study and receiving personal fees from Precipio Diagnostics, CellMax Life and IBEX outside the submitted work. A.E. is employed by Mackenzie Health, Toronto. T.v.d.K. is employed by University Health Network, Toronto; the time spent on the project was supported by a research agreement with financial support from Google LLC. R.A. and P.A.H. were compensated by Google LLC for their consultation and annotations as expert uropathologists. H.Y. reports nonfinancial support from Aillis Inc. during the conduction of the present study. W.L., J.L., W.S. and C.A. have a patent (US 62/852,625) pending. K.Kim, B.B., Y.W. K., H.-S.L. and J.P. are employees of VUNO Inc. M.B.A. reported receiving personal fees from Google LLC during the conduct of the present study and receiving personal fees from Precipio Diagnostics, CellMax Life and IBEX outside the submitted work. A.E. is employed by Mackenzie Health, Torontoa. T.v.d.K. is employed by University Health Network, Toronto; the time spent on the project was supported by a research agreement with financial support from Google LLC. M.Z., R.A. and P.A.H. were compensated by Google LLC for their consultation and annotations as expert uropathologists. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the PANDA challenge and study setup.
The global competition attracted participants from 65 countries (top: size of the circle for each country illustrates the number of participants). The study was split into two phases. First, in the development phase (bottom left), teams competed in building the best-performing Gleason grading algorithm, having full access to a development set for algorithm training and limited access to a tuning set for estimating algorithm performance. In the validation phase (bottom right), a selection of algorithms was independently evaluated on internal and external datasets against reference grading obtained through consensus across expert uropathologist panels, and compared with groups of international and US general pathologists on subsets of the data.
Fig. 2
Fig. 2. Progression of algorithms’ performances throughout the competition.
During the competition, teams could submit their algorithm for evaluation on the tuning set, after which they received their score. At the same time, algorithms were evaluated on the internal validation set, without disclosing these results to the participating teams. a,b, The development of the top score obtained by any team (a) and the median score over all daily submissions (b) throughout the timeline of the competition showing the rapid improvement of the algorithms. c, A large fraction of teams reached high scores in the range 0.80–0.90, and retained their performance on the internal validation set. Source data
Fig. 3
Fig. 3. Algorithm agreement with reference standards and comparison to pathologists.
Algorithms’ agreement (quadratically weighted κ) with reference standards established by uropathologists shown for the internal and external validation sets (left). On subsets of the internal and US external validation sets, agreement of general pathologists with the reference standards is additionally shown for comparison (right).
Fig. 4
Fig. 4. Algorithm performance in detecting prostate tumors on validation sets and comparison to pathologists.
a,b,c, The sensitivity and specificity of the algorithms relative to reference standards established by uropathologists shown for the internal (a) and external validation sets (b, c). b, US external validation. c, EU external validation. d,e, On subsets of the internal and US external validation sets, the sensitivity and specificity of general pathologists are also shown for comparison. d, International pathologists' comparison. e, US pathologists' comparison.
Fig. 5
Fig. 5. ISUP GG assignment by algorithms and pathologists.
a,b, Algorithms compared with international general pathologists on a subset of the internal validation set (a) and US general pathologists on a subset of the US external validation set (b). Cases are ordered primarily by the reference ISUP GG and secondarily by the average GG of the algorithms and pathologists. Algorithms and pathologists are ordered by their agreement (quadratically weighted κ) with the reference standard on the respective sets. The comparison between pathologists and algorithms gives insight into the difference in their operating points and for which GGs most miscalls are made. The algorithms are less likely to miss a biopsy containing cancer, but at the same time more likely to overgrade benign cases.
Extended Data Fig. 1
Extended Data Fig. 1. Flow charts of inclusion and exclusion for the various datasets.
(a) Data originating from Radboud University Medical Center (development, tuning and internal validation sets, and international pathologist comparison), (b) Data originating from Karolinska Institutet (development, tuning and internal validation sets), (c) Data originating from the United States (US external validation set, and US pathologist comparison), (d) Data originating from Karolinska University Hospital (EU external validation set).
Extended Data Fig. 2
Extended Data Fig. 2. Individual algorithms’ agreement with the reference standard for the validation sets.
Concordance with ISUP GG of the reference standard (Cohen’s quadratically weighted kappa with 95% CI over cases) is shown for each algorithm on each validation set. The dashed line indicates the mean of all teams on the validation set in question.
Extended Data Fig. 3
Extended Data Fig. 3. Individual algorithms’ Sensitivity and specificity for the validation sets.
Performance in detecting biopsies containing cancer (sensitity and specificity with 95% CI over cases) is shown for each algorithm on each validation set.
Extended Data Fig. 4
Extended Data Fig. 4. Visualization of grade assignment by algorithms for the internal validation set.
Cases are ordered by the reference ISUP grade group and average grade group of the AI cohort.
Extended Data Fig. 5
Extended Data Fig. 5. Visualization of grade assignment by algorithms for the US external validation set.
Cases are ordered by the reference ISUP grade group and average grade group of the AI cohort.
Extended Data Fig. 6
Extended Data Fig. 6. Visualization of grade assignment by algorithms for the EU external validation set.
Cases are ordered by the reference ISUP grade group and average grade group of the AI cohort.
Extended Data Fig. 7
Extended Data Fig. 7. Comparison of challenge algorithms to prior work.
The performance of the teams’ algorithms was computed on validation (sub)sets of earlier work. For each validation set, we additionally show the performance of the original algorithm. Source data

Comment in

References

    1. Epstein JI. An update of the gleason grading system. J. Urol. 2010;183:433–440. doi: 10.1016/j.juro.2009.10.046. - DOI - PubMed
    1. Mohler JL, et al. Prostate cancer, version 2.2019, NCCN clinical practice guidelines in oncology. J. Natl Compr. Canc. Netw. 2019;17:479–505. doi: 10.6004/jnccn.2019.0023. - DOI - PubMed
    1. van Leenders GJLH, et al. The 2019 international society of urological pathology (ISUP) consensus conference on grading of prostatic carcinoma. Am. J. Surg. Pathol. 2020;44:e87. doi: 10.1097/PAS.0000000000001497. - DOI - PMC - PubMed
    1. Epstein JI, et al. The 2014 International Society of Urological Pathology (ISUP) consensus conference on gleason grading of prostatic carcinoma: Definition of grading patterns and proposal for a new grading system. Am. J. Surg. Pathol. 2016;40:244–252. doi: 10.1097/PAS.0000000000000530. - DOI - PubMed
    1. Pierorazio PM, Walsh PC, Partin AW, Epstein JI. Prognostic Gleason grade grouping: data based on the modified Gleason scoring system. BJU Int. 2013;111:753–760. doi: 10.1111/j.1464-410X.2012.11611.x. - DOI - PMC - PubMed

Publication types