Review

. 2024 Feb;21(2):195-212.

doi: 10.1038/s41592-023-02151-z. Epub 2024 Feb 12.

Metrics reloaded: recommendations for image analysis validation

Lena Maier-Hein^#^{1

2

3

4

5}, Annika Reinke^#^{6

7

8}, Patrick Godau^{9

10

11}, Minu D Tizabi^{9

11}, Florian Buettner^{12

13

14

15

16}, Evangelia Christodoulou⁹, Ben Glocker¹⁷, Fabian Isensee^{18

19}, Jens Kleesiek²⁰, Michal Kozubek²¹, Mauricio Reyes^{22

23}, Michael A Riegler^{24

25}, Manuel Wiesenfarth²⁶, A Emre Kavur^{9

18

19}, Carole H Sudre^{27

28}, Michael Baumgartner¹⁸, Matthias Eisenmann⁹, Doreen Heckmann-Nötzel^{9

11}, Tim Rädsch^{9

29}, Laura Acion³⁰, Michela Antonelli^{28

31}, Tal Arbel³², Spyridon Bakas^{33

34}, Arriel Benis^{35

36}, Matthew B Blaschko³⁷, M Jorge Cardoso²⁸, Veronika Cheplygina³⁸, Beth A Cimini³⁹, Gary S Collins⁴⁰, Keyvan Farahani⁴¹, Luciana Ferrer⁴², Adrian Galdran^{43

44}, Bram van Ginneken^{45

46}, Robert Haase^{47

48

49}, Daniel A Hashimoto^{50

51}, Michael M Hoffman^{52

53

54

55}, Merel Huisman⁵⁶, Pierre Jannin^{57

58}, Charles E Kahn⁵⁹, Dagmar Kainmueller^{60

61}, Bernhard Kainz^{62

63}, Alexandros Karargyris⁶⁴, Alan Karthikesalingam⁶⁵, Florian Kofler⁶⁶, Annette Kopp-Schneider²⁶, Anna Kreshuk⁶⁷, Tahsin Kurc⁶⁸, Bennett A Landman⁶⁹, Geert Litjens⁷⁰, Amin Madani⁷¹, Klaus Maier-Hein^{18

72}, Anne L Martel^{53

55

73}, Peter Mattson⁷⁴, Erik Meijering⁷⁵, Bjoern Menze⁷⁶, Karel G M Moons⁷⁷, Henning Müller^{78

79}, Brennan Nichyporuk⁸⁰, Felix Nickel⁸¹, Jens Petersen¹⁸, Nasir Rajpoot⁸², Nicola Rieke⁸³, Julio Saez-Rodriguez^{84

85}, Clara I Sánchez⁸⁶, Shravya Shetty⁸⁷, Maarten van Smeden⁷⁷, Ronald M Summers⁸⁸, Abdel A Taha⁸⁹, Aleksei Tiulpin^{90

91}, Sotirios A Tsaftaris⁹², Ben Van Calster^{93

94}, Gaël Varoquaux⁹⁵, Paul F Jäger^{96

97}

Affiliations

¹ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
² German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
³ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁴ Medical Faculty, Heidelberg University, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁵ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁶ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁷ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁸ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁹ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany.
¹⁰ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany.
¹¹ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany.
¹² German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt am Main, Germany.
¹³ German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany.
¹⁴ Department of Medicine, Goethe University Frankfurt, Frankfurt am Main, Germany.
¹⁵ Department of Informatics, Goethe University Frankfurt, Frankfurt am Main, Germany.
¹⁶ Frankfurt Cancer Insititute, Frankfurt am Main, Germany.
¹⁷ Department of Computing, Imperial College London, South Kensington Campus, London, UK.
¹⁸ German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Heidelberg, Germany.
¹⁹ German Cancer Research Center (DKFZ) Heidelberg, HI Applied Computer Vision Lab, Heidelberg, Germany.
²⁰ Institute for AI in Medicine, University Medicine Essen, Essen, Germany.
²¹ Centre for Biomedical Image Analysis and Faculty of Informatics, Masaryk University, Brno, Czech Republic.
²² ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland.
²³ Department of Radiation Oncology, University Hospital Bern, University of Bern, Bern, Switzerland.
²⁴ Simula Metropolitan Center for Digital Engineering, Oslo, Norway.
²⁵ Department of Computer Science, UiT The Arctic University of Norway, Tromsø, Norway.
²⁶ German Cancer Research Center (DKFZ) Heidelberg, Division of Biostatistics, Heidelberg, Germany.
²⁷ MRC Unit for Lifelong Health and Ageing at UCL and Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK.
²⁸ School of Biomedical Engineering and Imaging Science, King's College London, London, UK.
²⁹ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany.
³⁰ Instituto de Cálculo, CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina.
³¹ Centre for Medical Image Computing, University College London, London, UK.
³² Centre for Intelligent Machines and MILA (Québec Artificial Intelligence Institute), McGill University, Montréal, Quebec, Canada.
³³ Division of Computational Pathology, Department of Pathology & Laboratory Medicine, Indiana University School of Medicine, IU Health Information and Translational Sciences Building, Indianapolis, IN, USA.
³⁴ Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA.
³⁵ Department of Digital Medical Technologies, Holon Institute of Technology, Holon, Israel.
³⁶ European Federation for Medical Informatics, Le Mont-sur-Lausanne, Switzerland.
³⁷ Center for Processing Speech and Images, Department of Electrical Engineering, KU Leuven, Leuven, Belgium.
³⁸ Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark.
³⁹ Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴⁰ Centre for Statistics in Medicine, University of Oxford, Nuffield Orthopaedic Centre, Oxford, UK.
⁴¹ Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA.
⁴² Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Ciudad Autónoma de Buenos Aires, Buenos Aires, Argentina.
⁴³ BCN Medtech, Universitat Pompeu Fabra, Barcelona, Spain.
⁴⁴ Australian Institute for Machine Learning AIML, University of Adelaide, Adelaide, South Australia, Australia.
⁴⁵ Fraunhofer MEVIS, Bremen, Germany.
⁴⁶ Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, the Netherlands.
⁴⁷ Technische Universität (TU) Dresden, DFG Cluster of Excellence 'Physics of Life', Dresden, Germany.
⁴⁸ Center for Systems Biology, Dresden, Germany.
⁴⁹ Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Leipzig University, Leipzig, Germany.
⁵⁰ Department of Surgery, Perelman School of Medicine, Philadelphia, PA, USA.
⁵¹ General Robotics Automation Sensing and Perception Laboratory, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
⁵² Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.
⁵³ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
⁵⁴ Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
⁵⁵ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
⁵⁶ Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, the Netherlands.
⁵⁷ Laboratoire Traitement du Signal et de l'Image - UMR_S 1099, Université de Rennes 1, Rennes, France.
⁵⁸ INSERM, Paris, France.
⁵⁹ Department of Radiology and Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
⁶⁰ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biomedical Image Analysis and HI Helmholtz Imaging, Berlin, Germany.
⁶¹ Digital Engineering Faculty, University of Potsdam, Potsdam, Germany.
⁶² Department of Computing, Faculty of Engineering, Imperial College London, London, UK.
⁶³ Department AIBE, Friedrich-Alexander-Universität (FAU), Erlangen-Nürnberg, Germany.
⁶⁴ IHU Strasbourg, Strasbourg, France.
⁶⁵ Google Health DeepMind, London, UK.
⁶⁶ Helmholtz AI, Oberschleißheim, Germany.
⁶⁷ Cell Biology and Biophysics Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
⁶⁸ Department of Biomedical Informatics, Stony Brook University, Health Science Center, Stony Brook, NY, USA.
⁶⁹ Electrical Engineering, Vanderbilt University, Nashville, TN, USA.
⁷⁰ Department of Pathology, Radboud University Medical Center, Nijmegen, the Netherlands.
⁷¹ Department of Surgery, University Health Network, Philadelphia, PA, USA.
⁷² Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany.
⁷³ Physical Sciences, Sunnybrook Research Institute, Toronto, Ontario, Canada.
⁷⁴ Google, 1600 Amphitheatre Pkwy, Mountain View, CA, USA.
⁷⁵ School of Computer Science and Engineering, University of New South Wales, UNSW Sydney, Kensington, New South Wales, Australia.
⁷⁶ Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
⁷⁷ Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands.
⁷⁸ Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
⁷⁹ Medical Faculty, University of Geneva, Geneva, Switzerland.
⁸⁰ MILA (Québec Artificial Intelligence Institute), Montréal, Quebec, Canada.
⁸¹ Department of General, Visceral and Thoracic Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
⁸² Tissue Image Analytics Laboratory, Department of Computer Science, University of Warwick, Coventry, UK.
⁸³ NVIDIA, München, Germany.
⁸⁴ Institute for Computational Biomedicine, Heidelberg University, Heidelberg, Germany.
⁸⁵ Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany.
⁸⁶ Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands.
⁸⁷ Google Health, Google, Palo Alto, CA, USA.
⁸⁸ National Institutes of Health Clinical Center, Bethesda, MD, USA.
⁸⁹ Institute of Information Systems Engineering, TU Wien, Vienna, Austria.
⁹⁰ Research Unit of Health Sciences and Technology, Faculty of Medicine, University of Oulu, Oulu, Finland.
⁹¹ Neurocenter Oulu, Oulu University Hospital, Oulu, Finland.
⁹² School of Engineering, The University of Edinburgh, Edinburgh, Scotland.
⁹³ Department of Development and Regeneration and EPI-centre, KU Leuven, Leuven, Belgium.
⁹⁴ Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
⁹⁵ Parietal project team, INRIA Saclay-Île de France, Palaiseau, France.
⁹⁶ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.
⁹⁷ German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.

^# Contributed equally.

PMID: 38347141
PMCID: PMC11182665
DOI: 10.1038/s41592-023-02151-z

Review

Metrics reloaded: recommendations for image analysis validation

Lena Maier-Hein et al. Nat Methods. 2024 Feb.

. 2024 Feb;21(2):195-212.

doi: 10.1038/s41592-023-02151-z. Epub 2024 Feb 12.

Authors

Affiliations

¹ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
² German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
³ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁴ Medical Faculty, Heidelberg University, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁵ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁶ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁷ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁸ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁹ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany.
¹⁰ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany.
¹¹ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany.
¹² German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt am Main, Germany.
¹³ German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany.
¹⁴ Department of Medicine, Goethe University Frankfurt, Frankfurt am Main, Germany.
¹⁵ Department of Informatics, Goethe University Frankfurt, Frankfurt am Main, Germany.
¹⁶ Frankfurt Cancer Insititute, Frankfurt am Main, Germany.
¹⁷ Department of Computing, Imperial College London, South Kensington Campus, London, UK.
¹⁸ German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Heidelberg, Germany.
¹⁹ German Cancer Research Center (DKFZ) Heidelberg, HI Applied Computer Vision Lab, Heidelberg, Germany.
²⁰ Institute for AI in Medicine, University Medicine Essen, Essen, Germany.
²¹ Centre for Biomedical Image Analysis and Faculty of Informatics, Masaryk University, Brno, Czech Republic.
²² ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland.
²³ Department of Radiation Oncology, University Hospital Bern, University of Bern, Bern, Switzerland.
²⁴ Simula Metropolitan Center for Digital Engineering, Oslo, Norway.
²⁵ Department of Computer Science, UiT The Arctic University of Norway, Tromsø, Norway.
²⁶ German Cancer Research Center (DKFZ) Heidelberg, Division of Biostatistics, Heidelberg, Germany.
²⁷ MRC Unit for Lifelong Health and Ageing at UCL and Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK.
²⁸ School of Biomedical Engineering and Imaging Science, King's College London, London, UK.
²⁹ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany.
³⁰ Instituto de Cálculo, CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina.
³¹ Centre for Medical Image Computing, University College London, London, UK.
³² Centre for Intelligent Machines and MILA (Québec Artificial Intelligence Institute), McGill University, Montréal, Quebec, Canada.
³³ Division of Computational Pathology, Department of Pathology & Laboratory Medicine, Indiana University School of Medicine, IU Health Information and Translational Sciences Building, Indianapolis, IN, USA.
³⁴ Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA.
³⁵ Department of Digital Medical Technologies, Holon Institute of Technology, Holon, Israel.
³⁶ European Federation for Medical Informatics, Le Mont-sur-Lausanne, Switzerland.
³⁷ Center for Processing Speech and Images, Department of Electrical Engineering, KU Leuven, Leuven, Belgium.
³⁸ Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark.
³⁹ Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴⁰ Centre for Statistics in Medicine, University of Oxford, Nuffield Orthopaedic Centre, Oxford, UK.
⁴¹ Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA.
⁴² Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Ciudad Autónoma de Buenos Aires, Buenos Aires, Argentina.
⁴³ BCN Medtech, Universitat Pompeu Fabra, Barcelona, Spain.
⁴⁴ Australian Institute for Machine Learning AIML, University of Adelaide, Adelaide, South Australia, Australia.
⁴⁵ Fraunhofer MEVIS, Bremen, Germany.
⁴⁶ Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, the Netherlands.
⁴⁷ Technische Universität (TU) Dresden, DFG Cluster of Excellence 'Physics of Life', Dresden, Germany.
⁴⁸ Center for Systems Biology, Dresden, Germany.
⁴⁹ Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Leipzig University, Leipzig, Germany.
⁵⁰ Department of Surgery, Perelman School of Medicine, Philadelphia, PA, USA.
⁵¹ General Robotics Automation Sensing and Perception Laboratory, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
⁵² Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.
⁵³ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
⁵⁴ Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
⁵⁵ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
⁵⁶ Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, the Netherlands.
⁵⁷ Laboratoire Traitement du Signal et de l'Image - UMR_S 1099, Université de Rennes 1, Rennes, France.
⁵⁸ INSERM, Paris, France.
⁵⁹ Department of Radiology and Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
⁶⁰ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biomedical Image Analysis and HI Helmholtz Imaging, Berlin, Germany.
⁶¹ Digital Engineering Faculty, University of Potsdam, Potsdam, Germany.
⁶² Department of Computing, Faculty of Engineering, Imperial College London, London, UK.
⁶³ Department AIBE, Friedrich-Alexander-Universität (FAU), Erlangen-Nürnberg, Germany.
⁶⁴ IHU Strasbourg, Strasbourg, France.
⁶⁵ Google Health DeepMind, London, UK.
⁶⁶ Helmholtz AI, Oberschleißheim, Germany.
⁶⁷ Cell Biology and Biophysics Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
⁶⁸ Department of Biomedical Informatics, Stony Brook University, Health Science Center, Stony Brook, NY, USA.
⁶⁹ Electrical Engineering, Vanderbilt University, Nashville, TN, USA.
⁷⁰ Department of Pathology, Radboud University Medical Center, Nijmegen, the Netherlands.
⁷¹ Department of Surgery, University Health Network, Philadelphia, PA, USA.
⁷² Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany.
⁷³ Physical Sciences, Sunnybrook Research Institute, Toronto, Ontario, Canada.
⁷⁴ Google, 1600 Amphitheatre Pkwy, Mountain View, CA, USA.
⁷⁵ School of Computer Science and Engineering, University of New South Wales, UNSW Sydney, Kensington, New South Wales, Australia.
⁷⁶ Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
⁷⁷ Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands.
⁷⁸ Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
⁷⁹ Medical Faculty, University of Geneva, Geneva, Switzerland.
⁸⁰ MILA (Québec Artificial Intelligence Institute), Montréal, Quebec, Canada.
⁸¹ Department of General, Visceral and Thoracic Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
⁸² Tissue Image Analytics Laboratory, Department of Computer Science, University of Warwick, Coventry, UK.
⁸³ NVIDIA, München, Germany.
⁸⁴ Institute for Computational Biomedicine, Heidelberg University, Heidelberg, Germany.
⁸⁵ Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany.
⁸⁶ Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands.
⁸⁷ Google Health, Google, Palo Alto, CA, USA.
⁸⁸ National Institutes of Health Clinical Center, Bethesda, MD, USA.
⁸⁹ Institute of Information Systems Engineering, TU Wien, Vienna, Austria.
⁹⁰ Research Unit of Health Sciences and Technology, Faculty of Medicine, University of Oulu, Oulu, Finland.
⁹¹ Neurocenter Oulu, Oulu University Hospital, Oulu, Finland.
⁹² School of Engineering, The University of Edinburgh, Edinburgh, Scotland.
⁹³ Department of Development and Regeneration and EPI-centre, KU Leuven, Leuven, Belgium.
⁹⁴ Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
⁹⁵ Parietal project team, INRIA Saclay-Île de France, Palaiseau, France.
⁹⁶ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.
⁹⁷ German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.

^# Contributed equally.

PMID: 38347141
PMCID: PMC11182665
DOI: 10.1038/s41592-023-02151-z

Abstract

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTERESTS

The authors declare the following competing interests: Under his terms of employment, M.B.B. is entitled to stock options in Mona.health, a KU Leuven spinoff. F.B. is an employee of Siemens AG (Munich, Germany). F.B. reports funding from Merck (Darmstadt, Germany). B.v.G. is a shareholder of Thirona (Nijmegen, NL). B.G. was an employee of HeartFlow Inc (California, USA) and Kheiron Medical Technologies Ltd (London, UK). M.M.H. received an Nvidia GPU Grant. B.K. is a consultant for ThinkSono Ldt (London, UK). G.L. is on the advisory board of Canon Healthcare IT (Minnetonka, USA) and is a shareholder of Aiosyn BV (Nijmegen, NL). N.R. is an employee of Nvidia GmbH (Munich, Germany). J.S.-R. reports funding from GSK (Heidelberg, Germany), Pfizer (New York, USA) and Sanofi (Paris, France) and fees from Travere Therapeutics (California, USA), Stadapharm (Bad Vilbel, Germany), Astex Therapeutics (Cambridge, UK), Pfizer (New York, USA), and Grunenthal (Aachen, Germany). R.M.S. receives patent royalties from iCAD (New Hampshire, USA), ScanMed (Nebraska, USA), Philips (Amsterdam, NL), Translation Holdings (Alabama, USA) and PingAn (Shenzhen, China); his lab received research support from PingAn through a Cooperative Research and Development Agreement. S.A.T. receives financial support from Canon Medical Research Europe (Edinburgh, Scotland). The remaining authors declare no competing interests

Figures

**Extended Data Fig. 1:**
Subprocess S1 for selecting a problem category. Subprocess S1 for selecting a problem category. The Category Mapping maps a given research problem to the appropriate problem category with the goal of grouping problems by similarity of validation. The leaf nodes represent the categories: image-level classification, object detection, instance segmentation, or semantic segmentation. FP2.1 refers to fingerprint 2.1 (see Fig. SN 1.10). An overview of the symbols used in the process diagram is provided in Fig. SN 5.1.

**Extended Data Fig. 2:**
Subprocess S2 for selecting multi-class metrics (if any).Subprocess S2 for selecting multi-class metrics (if any). Applies to: image-level classification (ImLC). In the case of presence of class imbalance and no compensation of class imbalance being requested, one should follow the “No” branch. Decision guides are provided in Suppl. Note 2.7.1. A detailed description of the subprocess is given in Suppl. Note 2.2.

**Extended Data Fig. 3:**
Subprocess S3 for selecting a per-class counting metric (if any). Subprocess S3 for selecting a per-class counting metric (if any). Applies to: image-level classification (ImLC), object detection (ObD), and instance segmentation (InS). Decision guides are provided in Suppl. Note 2.7.2. A detailed description of the subprocess is given in Suppl. Notes 2.2, 2.4, and 2.5.

**Extended Data Fig. 4**
Subprocess S4 for selecting a multi-threshold metric (if any). Subprocess S4 for selecting a multi-threshold metric (if any). Applies to: image-level classification (ImLC), object detection (ObD), and instance segmentation (InS). Decision guides are provided in Suppl. Note 2.7.3. A detailed description of the subprocess is given in Suppl. Notes 2.2, 2.4, and 2.5.

**Extended Data Fig. 5:**
Subprocess S5 for selecting a calibration metric (if any). Subprocess S5 for selecting a calibration metric (if any). Applies to: image-level classification (ImLC). Decision guides are provided in Suppl. Note 2.7.4. A detailed description of the subprocess is given in Suppl. Note 2.6. Further suggested calibration metrics include the calibration loss [8], calibration slope [76], Expected Calibration Index (ECI) [87] and Observed:Expected ratio (O:E ratio) [68].

**Extended Data Fig. 6:**
Subprocess S6 for selecting overlap-based segmentation metrics (if any). Subprocess S6 for selecting overlap-based segmentation metrics (if any). Applies to: semantic segmentation (SemS) and instance segmentation (InS). Decision guides are provided in Suppl. Note 2.7.5. A detailed description of the subprocess is given in Suppl. Notes 2.3 and 2.5.

**Extended Data Fig. 7:**
Subprocess S7 for selecting a boundary-based segmentation metric (if any). Subprocess S7 for selecting a boundary-based segmentation metric (if any). Applies to: semantic segmentation (SemS) and instance segmentation (InS). Decision guides are provided in Suppl. Note 2.7.6. A detailed description of the subprocess is given in Suppl. Notes 2.3 and 2.5.

**Extended Data Fig. 8:**
Subprocess S8 for selecting the localization criterion. Subprocess S8 for selecting the localization criterion. Applies to: object detection (ObD) and instance segmentation (InS). Definitions of the localization criteria can be found in [66]. Decision guides are provided in Suppl. Note 2.7.7. A detailed description of the subprocess is given in Suppl. Notes 2.4 and 2.5.

**Extended Data Fig. 9:**
Subprocess S9 for selecting the assignment strategy. Subprocess S9 for selecting the assignment strategy. Applies to: object detection (ObD) and instance segmentation (InS). Assignment strategies are defined in [66]. Decision guides are provided in Suppl. Note 2.7.8. A detailed description of the subprocess is given in Suppl. Notes 2.4 and 2.5.

**Figure 1:. Contributions of the *Metrics Reloaded* framework.**
a) Motivation: Common problems related to metrics typically arise from (top left) inappropriate choice of the problem category (here: object detection confused with semantic segmentation), (top right) poor metric selection (here: neglecting the small size of structures) and (bottom) poor metric application (here: inappropriate aggregation scheme). Pitfalls are highlighted by lightning bolts, ∅ refers to the average *Dice Similarity Coefficient (DSC)* values. Green metric values correspond to a good metric value, whereas red values correspond to a poor value. Green check marks indicate desirable behavior of metrics, red crosses indicate undesirable behavior. b) *Metrics Reloaded* addresses these pitfalls. (1) To enable the selection of metrics that match the domain interest, the framework is based on the new concept of *problem fingerprinting*, i.e., the generation of a structured representation of the given biomedical problem that captures all properties that are relevant for metric selection. Based on the problem fingerprint, *Metrics Reloaded* guides the user through the process of metric selection and application while raising awareness of relevant pitfalls. (2) An instantiation of the framework for common biomedical use cases demonstrates its broad applicability. (3) A publicly available online tool facilitates application of the framework.

**Figure 2:. *Metrics Reloaded* recommendation framework from a user perspective.**
In **step 1 - problem fingerprinting**, the given biomedical image analysis problem is mapped to the appropriate image *problem category*, namely *image-level classification (ImLC)*, *semantic segmentation (SemS)*, *object detection (ObD)*, or *instance segmentation (InS)* (Fig. 4). The problem category and further characteristics of the given biomedical problem relevant for metric selection are then captured in a *problem fingerprint* (Fig. 3). In **step 2 - metric selection**, the user follows the respective coloured path of the chosen problem category (ImLC →, SemS →, ObD →, or InS →) to select a suitable pool of metrics from the *Metrics Reloaded* pools shown in green. When a tree branches, the fingerprint items determine which exact path to take. Finally, in **step 3 - metric application**, the user is supported in applying the metrics to a given data set. During the traversal of the decision tree, the user goes through *subprocesses*, indicated by the ⊞-symbol, which are provided in Extended Data Figs. 1–9 and represent relevant steps in the metric selection process. Ambiguities related to metric selection are resolved via *decision guides* (Suppl. Note 2.7) that help users make an educated decision when multiple options are possible. A comprehensive textual description of the recommendations for all four problem categories as well as for the selection of corresponding calibration metrics (if any) is provided in Suppl. Note 2.2 - Suppl. Note 2.6. An overview of the symbols used in the process diagram is provided in Fig. SN 5.1. Condensed versions of the mappings for every category can be found in Suppl. Note 2.2 for image-level classification, Suppl. Note 2.3 for semantic segmentation, Suppl. Note 2.4 for object detection, and Suppl. Note 2.5 for instance segmentation.

Figure 3:. Relevant properties of a driving biomedical image analysis problem are captured by the *problem fingerprint*
(selection for semantic segmentation shown here). The fingerprint comprises a set of items, each of which represents a specific property of the problem, is either binary or categorical, and must be instantiated by the user. Besides the problem category, the fingerprint comprises *domain interest-related*, *target structure-related*, *data set-related* and *algorithm output-related* properties. A comprehensive version of the fingerprints for all problem categories can be found in Figs. SN 2.7–SN 2.9 (image-level classification), Figs. SN 2.10/SN 2.11 (semantic segmentation), Figs. SN 2.12–SN 2.14 (object detection) and Figs. SN 2.15–SN 2.17 (instance segmentation). Used abbreviations: Prediction (Pred), Reference (Ref).

**Figure 4:. *Metrics Reloaded* fosters the convergence of validation methodology across modalities, application domains and classification scales.**
The framework considers problems in which categorical target variables are to be predicted at image, object and/or pixel level, resulting (from top to bottom) in *image-level classification*, *object detection*, *instance segmentation* or *semantic segmentation* problems. These problem categories are relevant across modalities (here computed tomography (CT), microscopy and endoscopy) and application domains. From left to right: annotation of (left) benign and malignant lesions in CT images [3], (middle) different cell types in microscopy images [46], and (right) medical instruments in laparoscopy images [48].

**Figure 5:. Instantiation of the framework with recommendations for concrete biomedical questions.**
From top to bottom: **(1)** Image classification for the examples of sperm motility classification [25] and disease classification in dermoscopic images [12, 84]. **(2)** Semantic segmentation of large objects for the examples of embryo segmentation from microscopy [79] and liver segmentation in computed tomography (CT) images [2, 73]. **(3)** Detection of multiple and arbitrarily located objects for the examples of cell detection and tracking during the autophagy process [56, 90] and multiple sclerosis (MS) lesion detection in multi-modal brain magnetic resonance imaging (MRI) images [14, 36]. **(4)** Instance segmentation of tubular objects for the examples of instance segmentation of neurons from the fruit fly [50, 54, 81] and surgical instrument instance segmentation [48]. The corresponding traversals through the decision trees are shown in Suppl. Note 4. An overview of the recommended metrics can be found in Suppl. Note 3.1, including relevant information for each metric.

See this image and copyright information in PMC

References

1. Adamson Adewole S and Smith Avery. Machine learning and health care disparities in dermatology, 2018. - PubMed
1. Antonelli Michela, Reinke Annika, Bakas Spyridon, Farahani Keyvan, Kopp-Schneider Annette, Landman Bennett A, Litjens Geert, Menze Bjoern, Ronneberger Olaf, Summers Ronald M, et al. The medical segmentation decathlon. Nature Communications, 13(1):1–13, 2022. - PMC - PubMed
1. Armato Samuel G III, McLennan Geoffrey, Bidaut Luc, McNitt-Gray Michael F, Meyer Charles R, Reeves Anthony P, Zhao Binsheng, Aberle Denise R, Henschke Claudia I, Hoffman Eric A, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011. - PMC - PubMed
1. Badgeley MA, Zech JR, Oakden-Rayner L, Glicksberg BS, Liu M, Gale W, McConnell MV, Percha B, Snyder TM, and Dudley JT. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj digit med. 2019; 2: 31, 2019. - PMC - PubMed
1. Birhane Abeba, Kalluri Pratyusha, Card Dallas, Agnew William, Dotan Ravit, and Bao Michelle. The values encoded in machine learning research. arXiv, June 2021.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Metrics reloaded: recommendations for image analysis validation

Affiliations

Metrics reloaded: recommendations for image analysis validation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources