Review

. 2024 Feb;21(2):182-194.

doi: 10.1038/s41592-023-02150-0. Epub 2024 Feb 12.

Understanding metric-related pitfalls in image analysis validation

Annika Reinke^#^{1

2

3}, Minu D Tizabi^#^{4

5}, Michael Baumgartner⁶, Matthias Eisenmann⁷, Doreen Heckmann-Nötzel^{7

8}, A Emre Kavur^{7

6

9}, Tim Rädsch^{7

10}, Carole H Sudre^{11

12}, Laura Acion¹³, Michela Antonelli^{12

14}, Tal Arbel¹⁵, Spyridon Bakas^{16

17}, Arriel Benis^{18

19}, Florian Buettner^{20

21

22

23

24}, M Jorge Cardoso¹², Veronika Cheplygina²⁵, Jianxu Chen²⁶, Evangelia Christodoulou⁷, Beth A Cimini²⁷, Keyvan Farahani²⁸, Luciana Ferrer²⁹, Adrian Galdran^{30

31}, Bram van Ginneken^{32

33}, Ben Glocker³⁴, Patrick Godau^{7

35

8}, Daniel A Hashimoto^{36

37}, Michael M Hoffman^{38

39

40

41}, Merel Huisman⁴², Fabian Isensee^{6

9}, Pierre Jannin^{43

44}, Charles E Kahn⁴⁵, Dagmar Kainmueller^{46

47}, Bernhard Kainz^{48

49}, Alexandros Karargyris⁵⁰, Jens Kleesiek⁵¹, Florian Kofler⁵², Thijs Kooi⁵³, Annette Kopp-Schneider⁵⁴, Michal Kozubek⁵⁵, Anna Kreshuk⁵⁶, Tahsin Kurc⁵⁷, Bennett A Landman⁵⁸, Geert Litjens⁵⁹, Amin Madani⁶⁰, Klaus Maier-Hein^{6

61}, Anne L Martel^{39

62}, Erik Meijering⁶³, Bjoern Menze⁶⁴, Karel G M Moons⁶⁵, Henning Müller^{66

67}, Brennan Nichyporuk⁶⁸, Felix Nickel⁶⁹, Jens Petersen⁶, Susanne M Rafelski⁷⁰, Nasir Rajpoot⁷¹, Mauricio Reyes^{72

73}, Michael A Riegler^{74

75}, Nicola Rieke⁷⁶, Julio Saez-Rodriguez^{77

78}, Clara I Sánchez⁷⁹, Shravya Shetty⁸⁰, Ronald M Summers⁸¹, Abdel A Taha⁸², Aleksei Tiulpin^{83

84}, Sotirios A Tsaftaris⁸⁵, Ben Van Calster^{86

87}, Gaël Varoquaux⁸⁸, Ziv R Yaniv⁸⁹, Paul F Jäger^{90

91}, Lena Maier-Hein^{92

93

94

95

96}

Affiliations

¹ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
² German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
³ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁴ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. m.tizabi@dkfz-heidelberg.de.
⁵ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany. m.tizabi@dkfz-heidelberg.de.
⁶ German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Heidelberg, Germany.
⁷ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany.
⁸ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany.
⁹ German Cancer Research Center (DKFZ) Heidelberg, HI Applied Computer Vision Lab, Heidelberg, Germany.
¹⁰ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany.
¹¹ MRC Unit for Lifelong Health and Ageing at UCL and Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK.
¹² School of Biomedical Engineering and Imaging Science, King's College London, London, UK.
¹³ Instituto de Cálculo, CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina.
¹⁴ Centre for Medical Image Computing, University College London, London, UK.
¹⁵ Centre for Intelligent Machines and MILA (Quebec Artificial Intelligence Institute), McGill University, Montréal, Quebec, Canada.
¹⁶ Division of Computational Pathology, Dept of Pathology & Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN, USA.
¹⁷ Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA.
¹⁸ Department of Digital Medical Technologies, Holon Institute of Technology, Holon, Israel.
¹⁹ European Federation for Medical Informatics, Le Mont-sur-Lausanne, Switzerland.
²⁰ German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt am Main, Germany.
²¹ German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany.
²² Goethe University Frankfurt, Department of Medicine, Frankfurt am Main, Germany.
²³ Goethe University Frankfurt, Department of Informatics, Frankfurt am Main, Germany.
²⁴ Frankfurt Cancer Insititute, Frankfurt am Main, Germany.
²⁵ Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark.
²⁶ Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany.
²⁷ Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
²⁸ Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA.
²⁹ Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Ciudad Autónoma de Buenos Aires, Buenos Aires, Argentina.
³⁰ Universitat Pompeu Fabra, Barcelona, Spain.
³¹ University of Adelaide, Adelaide, South Australia, Australia.
³² Fraunhofer MEVIS, Bremen, Germany.
³³ Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, the Netherlands.
³⁴ Department of Computing, Imperial College London, South Kensington Campus, London, UK.
³⁵ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany.
³⁶ Department of Surgery, Perelman School of Medicine, Philadelphia, PA, USA.
³⁷ General Robotics Automation Sensing and Perception Laboratory, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
³⁸ Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.
³⁹ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
⁴⁰ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
⁴¹ Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
⁴² Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, the Netherlands.
⁴³ Laboratoire Traitement du Signal et de l'Image - UMR_S 1099, Université de Rennes 1, Rennes, France.
⁴⁴ INSERM, Paris, France.
⁴⁵ Department of Radiology and Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
⁴⁶ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biomedical Image Analysis and HI Helmholtz Imaging, Berlin, Germany.
⁴⁷ University of Potsdam, Digital Engineering Faculty, Potsdam, Germany.
⁴⁸ Department of Computing, Faculty of Engineering, Imperial College London, London, UK.
⁴⁹ Department AIBE, Friedrich-Alexander-Universität (FAU), Erlangen-Nürnberg, Germany.
⁵⁰ IHU Strasbourg, Strasbourg, France.
⁵¹ Translational Image-guided Oncology (TIO), Institute for AI in Medicine (IKIM), University Medicine Essen, Essen, Germany.
⁵² Helmholtz AI, Oberschleißheim, Germany.
⁵³ Lunit, Seoul, South Korea.
⁵⁴ German Cancer Research Center (DKFZ) Heidelberg, Division of Biostatistics, Heidelberg, Germany.
⁵⁵ Centre for Biomedical Image Analysis and Faculty of Informatics, Masaryk University, Brno, Czech Republic.
⁵⁶ Cell Biology and Biophysics Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
⁵⁷ Department of Biomedical Informatics, Stony Brook University, Health Science Center, Stony Brook, NY, USA.
⁵⁸ Electrical Engineering, Vanderbilt University, Nashville, TN, USA.
⁵⁹ Department of Pathology, Radboud University Medical Center, Nijmegen, the Netherlands.
⁶⁰ Department of Surgery, University Health Network, Philadelphia, PA, USA.
⁶¹ Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany.
⁶² Physical Sciences, Sunnybrook Research Institute, Toronto, Ontario, Canada.
⁶³ School of Computer Science and Engineering, University of New South Wales, UNSW Sydney, Kensington, New South Wales, Australia.
⁶⁴ Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
⁶⁵ Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands.
⁶⁶ Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
⁶⁷ Medical Faculty, University of Geneva, Geneva, Switzerland.
⁶⁸ MILA (Quebec Artificial Intelligence Institute), Montréal, Quebec, Canada.
⁶⁹ Department of General, Visceral and Thoracic Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
⁷⁰ Allen Institute for Cell Science, Seattle, WA, USA.
⁷¹ Tissue Image Analytics Laboratory, Department of Computer Science, University of Warwick, Coventry, UK.
⁷² ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland.
⁷³ Department of Radiation Oncology, University Hospital Bern, University of Bern, Bern, Switzerland.
⁷⁴ Simula Metropolitan Center for Digital Engineering, Oslo, Norway.
⁷⁵ UiT The Arctic University of Norway, Tromsø, Norway.
⁷⁶ NVIDIA GmbH, München, Germany.
⁷⁷ Institute for Computational Biomedicine, Heidelberg University, Heidelberg, Germany.
⁷⁸ Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany.
⁷⁹ Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands.
⁸⁰ Google Health, Google, Palo Alto, CA, USA.
⁸¹ National Institutes of Health Clinical Center, Bethesda, MD, USA.
⁸² Institute of Information Systems Engineering, TU Wien, Vienna, Austria.
⁸³ Research Unit of Health Sciences and Technology, Faculty of Medicine, University of Oulu, Oulu, Finland.
⁸⁴ Neurocenter Oulu, Oulu University Hospital, Oulu, Finland.
⁸⁵ School of Engineering, The University of Edinburgh, Edinburgh, Scotland.
⁸⁶ Department of Development and Regeneration and EPI-centre, KU Leuven, Leuven, Belgium.
⁸⁷ Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
⁸⁸ Parietal project team, INRIA Saclay-Île de France, Palaiseau, France.
⁸⁹ National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA.
⁹⁰ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.
⁹¹ German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.
⁹² German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹³ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹⁴ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹⁵ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹⁶ Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.

^# Contributed equally.

PMID: 38347140
PMCID: PMC11181963
DOI: 10.1038/s41592-023-02150-0

Review

Understanding metric-related pitfalls in image analysis validation

Annika Reinke et al. Nat Methods. 2024 Feb.

. 2024 Feb;21(2):182-194.

doi: 10.1038/s41592-023-02150-0. Epub 2024 Feb 12.

Authors

Affiliations

¹ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
² German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
³ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. a.reinke@dkfz-heidelberg.de.
⁴ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. m.tizabi@dkfz-heidelberg.de.
⁵ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany. m.tizabi@dkfz-heidelberg.de.
⁶ German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Heidelberg, Germany.
⁷ German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany.
⁸ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany.
⁹ German Cancer Research Center (DKFZ) Heidelberg, HI Applied Computer Vision Lab, Heidelberg, Germany.
¹⁰ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany.
¹¹ MRC Unit for Lifelong Health and Ageing at UCL and Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK.
¹² School of Biomedical Engineering and Imaging Science, King's College London, London, UK.
¹³ Instituto de Cálculo, CONICET - Universidad de Buenos Aires, Buenos Aires, Argentina.
¹⁴ Centre for Medical Image Computing, University College London, London, UK.
¹⁵ Centre for Intelligent Machines and MILA (Quebec Artificial Intelligence Institute), McGill University, Montréal, Quebec, Canada.
¹⁶ Division of Computational Pathology, Dept of Pathology & Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN, USA.
¹⁷ Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA.
¹⁸ Department of Digital Medical Technologies, Holon Institute of Technology, Holon, Israel.
¹⁹ European Federation for Medical Informatics, Le Mont-sur-Lausanne, Switzerland.
²⁰ German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt am Main, Germany.
²¹ German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany.
²² Goethe University Frankfurt, Department of Medicine, Frankfurt am Main, Germany.
²³ Goethe University Frankfurt, Department of Informatics, Frankfurt am Main, Germany.
²⁴ Frankfurt Cancer Insititute, Frankfurt am Main, Germany.
²⁵ Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark.
²⁶ Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany.
²⁷ Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
²⁸ Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA.
²⁹ Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Ciudad Autónoma de Buenos Aires, Buenos Aires, Argentina.
³⁰ Universitat Pompeu Fabra, Barcelona, Spain.
³¹ University of Adelaide, Adelaide, South Australia, Australia.
³² Fraunhofer MEVIS, Bremen, Germany.
³³ Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, the Netherlands.
³⁴ Department of Computing, Imperial College London, South Kensington Campus, London, UK.
³⁵ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany.
³⁶ Department of Surgery, Perelman School of Medicine, Philadelphia, PA, USA.
³⁷ General Robotics Automation Sensing and Perception Laboratory, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
³⁸ Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.
³⁹ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
⁴⁰ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
⁴¹ Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
⁴² Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, the Netherlands.
⁴³ Laboratoire Traitement du Signal et de l'Image - UMR_S 1099, Université de Rennes 1, Rennes, France.
⁴⁴ INSERM, Paris, France.
⁴⁵ Department of Radiology and Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
⁴⁶ Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biomedical Image Analysis and HI Helmholtz Imaging, Berlin, Germany.
⁴⁷ University of Potsdam, Digital Engineering Faculty, Potsdam, Germany.
⁴⁸ Department of Computing, Faculty of Engineering, Imperial College London, London, UK.
⁴⁹ Department AIBE, Friedrich-Alexander-Universität (FAU), Erlangen-Nürnberg, Germany.
⁵⁰ IHU Strasbourg, Strasbourg, France.
⁵¹ Translational Image-guided Oncology (TIO), Institute for AI in Medicine (IKIM), University Medicine Essen, Essen, Germany.
⁵² Helmholtz AI, Oberschleißheim, Germany.
⁵³ Lunit, Seoul, South Korea.
⁵⁴ German Cancer Research Center (DKFZ) Heidelberg, Division of Biostatistics, Heidelberg, Germany.
⁵⁵ Centre for Biomedical Image Analysis and Faculty of Informatics, Masaryk University, Brno, Czech Republic.
⁵⁶ Cell Biology and Biophysics Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
⁵⁷ Department of Biomedical Informatics, Stony Brook University, Health Science Center, Stony Brook, NY, USA.
⁵⁸ Electrical Engineering, Vanderbilt University, Nashville, TN, USA.
⁵⁹ Department of Pathology, Radboud University Medical Center, Nijmegen, the Netherlands.
⁶⁰ Department of Surgery, University Health Network, Philadelphia, PA, USA.
⁶¹ Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany.
⁶² Physical Sciences, Sunnybrook Research Institute, Toronto, Ontario, Canada.
⁶³ School of Computer Science and Engineering, University of New South Wales, UNSW Sydney, Kensington, New South Wales, Australia.
⁶⁴ Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
⁶⁵ Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands.
⁶⁶ Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland.
⁶⁷ Medical Faculty, University of Geneva, Geneva, Switzerland.
⁶⁸ MILA (Quebec Artificial Intelligence Institute), Montréal, Quebec, Canada.
⁶⁹ Department of General, Visceral and Thoracic Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
⁷⁰ Allen Institute for Cell Science, Seattle, WA, USA.
⁷¹ Tissue Image Analytics Laboratory, Department of Computer Science, University of Warwick, Coventry, UK.
⁷² ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland.
⁷³ Department of Radiation Oncology, University Hospital Bern, University of Bern, Bern, Switzerland.
⁷⁴ Simula Metropolitan Center for Digital Engineering, Oslo, Norway.
⁷⁵ UiT The Arctic University of Norway, Tromsø, Norway.
⁷⁶ NVIDIA GmbH, München, Germany.
⁷⁷ Institute for Computational Biomedicine, Heidelberg University, Heidelberg, Germany.
⁷⁸ Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany.
⁷⁹ Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands.
⁸⁰ Google Health, Google, Palo Alto, CA, USA.
⁸¹ National Institutes of Health Clinical Center, Bethesda, MD, USA.
⁸² Institute of Information Systems Engineering, TU Wien, Vienna, Austria.
⁸³ Research Unit of Health Sciences and Technology, Faculty of Medicine, University of Oulu, Oulu, Finland.
⁸⁴ Neurocenter Oulu, Oulu University Hospital, Oulu, Finland.
⁸⁵ School of Engineering, The University of Edinburgh, Edinburgh, Scotland.
⁸⁶ Department of Development and Regeneration and EPI-centre, KU Leuven, Leuven, Belgium.
⁸⁷ Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
⁸⁸ Parietal project team, INRIA Saclay-Île de France, Palaiseau, France.
⁸⁹ National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA.
⁹⁰ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.
⁹¹ German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany. p.jaeger@dkfz-heidelberg.de.
⁹² German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹³ German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹⁴ Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹⁵ National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.
⁹⁶ Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany. l.maier-hein@dkfz-heidelberg.de.

^# Contributed equally.

PMID: 38347140
PMCID: PMC11181963
DOI: 10.1038/s41592-023-02150-0

Abstract

Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: F.B. is an employee of Siemens AG (Munich, Germany). B.v.G. is a shareholder of Thirona (Nijmegen, NL). B.G. is an employee of HeartFlow Inc (California, USA) and Kheiron Medical Technologies Ltd (London, UK). M.M.H. received an Nvidia GPU Grant. Th. K. is an employee of Lunit (Seoul, South Korea). G.L. is on the advisory board of Canon Healthcare IT (Minnetonka, USA) and is a shareholder of Aiosyn BV (Nijmegen, NL). Na.R. is the founder and CSO of Histofy (New York, USA). Ni.R. is an employee of Nvidia GmbH (Munich, Germany). J.S.-R. reports funding from GSK (Heidelberg, Germany), Pfizer (New York, USA) and Sanofi (Paris, France) and fees from Travere Therapeutics (California, USA), Stadapharm (Bad Vilbel, Germany), Astex Therapeutics (Cambridge, UK), Pfizer (New York, USA), and Grunenthal (Aachen, Germany). R.M.S. receives patent royalties from iCAD (New Hampshire, USA), ScanMed (Nebraska, USA), Philips (Amsterdam, NL), Translation Holdings (Alabama, USA) and PingAn (Shenzhen, China); his lab received research support from PingAn through a Cooperative Research and Development Agreement. S.A.T. receives financial support from Canon Medical Research Europe (Edinburgh, Scotland). The remaining authors declare no competing interests.

Figures

**Extended Data Fig. 1. [P2.2] Disregard of the properties of the target structures.**
[P2.2] **Disregard of the properties of the target structures. (a) Small structure sizes.** The predictions of two algorithms (*Prediction 1/2*) differ in only a single pixel. In the case of the small structure (bottom row), this has a substantial effect on the corresponding Dice Similarity Coefficient (DSC) metric value (similar for the Intersection over Union (IoU)). This pitfall is also relevant for other overlap-based metrics such as the centerline Dice Similarity Coefficient (clDice), and localization criteria such as Box/Approx/Mask IoU and Intersection over Reference (IoR). **(b) Complex structure shapes.** Common overlap-based metrics (here: DSC) are unaware of complex structure shapes and treat *Predictions 1* and 2 equally. The clDice uncovers the fact that *Predictions 1* misses the fine-granular branches of the reference and favors *Predictions 2*, which focuses on the center line of the object. This pitfall is also relevant for other overlap-based such as metrics IoU and pixel-level F_β Score as well as localization criteria such as Box/Approx/Mask IoU, Center Distance, Mask IoU > 0, Point inside Mask/Box/Approx, and IoR.

**Extended Data Fig. 2. [P2.4] Disregard of the properties of the algorithm output.**
[P2.4] **Disregard of the properties of the algorithm output**. **(a) Possibility of overlapping predictions.** If multiple structures of the same type can be seen within the same image (here: reference objects R1 and R2), it is generally advisable to phrase the problem as instance segmentation (InS; right) rather than semantic segmentation (SemS; left). This way, issues with boundary-based metrics resulting from comparing a given structure boundary to the boundary of the wrong instance in the reference can be avoided. In the provided example, the distance of the red boundary pixel to the reference, as measured by a boundary-based metric in SemS problems, would be zero, because different instances of the same structure cannot be distinguished. This problem is overcome by phrasing the problem as InS. In this case, (only) the boundary of the matched instance (here: R2) is considered for distance computation. **(b) Possibility of empty prediction or reference.** Each column represents a potential scenario for per-image validation of objects, categorized by whether True Positives (TPs), False Negatives (FNs), and False Positives (FPs) are present (n > 0) or not (n = 0) after matching/assignment. The sketches on the top showcase each scenario when setting “n > 0” to “n = 1”. For each scenario, Sensitivity, Positive Predictive Value (PPV), and the F₁ Score are calculated. Some scenarios yield undefined values (Not a Number (NaN)).

**Figure 1:**
Examples of metric-related pitfalls in image analysis validation. (A) Medical image analysis example: Voxel-based metrics are not appropriate for detection problems. Measuring the voxel-level performance of a prediction yields a near-perfect Sensitivity. However, the Sensitivity at the instance level reveals that lesions are actually missed by the algorithm. (B) Biological image analysis example: The task of predicting fibrillarin in the dense fibrillary component of the nucleolus should be phrased as a segmentation task, for which segmentation metrics reveal the low quality of the prediction. Phrasing the task as image reconstruction instead and validating it using metrics such as the Pearson Correlation Coefficient yields misleadingly high metric scores [4, 26, 29, 36, 36].

**Figure 2:**
Overview of the taxonomy for metric-related pitfalls. Pitfalls can be grouped into three main categories: [P1] Pitfalls related to the inadequate choice of the problem category, [P2] pitfalls related to poor metric selection, and [P3] pitfalls related to poor metric application. [P2] and [P3] are further split into subcategories. For all categories, pitfall sources are presented (green), with references to corresponding illustrations of representative examples. Note that the order in which the pitfall sources are presented does not correlate with importance.

**Figure 3:**
[P1] Pitfalls related to the inadequate choice of the problem category. **Wrong choice of problem category.** Effect of using segmentation metrics for object detection problems. The pixel-level Dice Similarity Coefficient (DSC) of a prediction recognizing every structure (*Prediction 2*) is lower than that of a prediction that only recognizes one of the three structures (*Prediction 1*).

**Figure 4:. [P2.1] Disregard of the domain interest.**
**(a) Importance of structure boundaries.** The predictions of two algorithms (*Prediction 1/2*) capture the boundary of the given structure substantially differently, but lead to the exact same Dice Similarity Coefficient (DSC), due to its boundary un- awareness. This pitfall is also relevant for other overlap-based metrics such as centerline Dice Similarity Coefficient (clDice), pixel-level F_β Score, and Intersection over Union (IoU), as well as localization criteria such as Box/Approx/Mask IoU, Center Distance, Mask IoU > 0, Point inside Mask/Box/Approx, and Intersection over Reference (IoR). **(b) Unequal severity of class con- fusions.** When predicting the severity of a disease for three patients in an ordinal classification problem, *Prediction 1* assumes a much lower severity for *Patient 3* than actually observed. This critical issue is overlooked by common metrics (here: Accuracy), which measure no difference to *Prediction 2*, which assesses the severity much better. Metrics with pre-defined weights (here: Expected Cost (EC)) correctly penalize *Prediction 1* much more than *Prediction 2*. This pitfall is also relevant for other counting metrics, such as Balanced Accuracy (BA), F_β Score, Positive Likelihood Ratio (LR+), Matthews Correlation Coefficient (MCC), Net Benefit (NB), Negative Predictive Value (NPV), Positive Predictive Value (PPV), Sensitivity, and Specificity.

**Figure 5:. [P2.3] Disregard of the properties of the data set.**
**(a) High class imbalance.** In the case of underrepresented classes, common metrics may yield misleading values. In the given example, Accuracy and Balanced Accuracy (BA) have a high score despite the high amount of False Positive (FP) samples. The class imbalance is only uncovered by metrics considering predictive values (here: Matthews Correlation Coefficient (MCC)). This pitfall is also relevant for other counting and multi-threshold metrics such as Area under the Receiver Operating Characteristic Curve (AUROC), Expected Cost (EC) (depending on the chosen costs), Positive Likelihood Ratio (LR+), Net Benefit (NB), Sensitivity, Specificity, and Weighted Cohen’s Kappa (WCK). **(b) Small test set size.** The values of the Expected Calibration Error (ECE) depend on the sample size. Even for a simulated perfectly calibrated model, the ECE will be substantially greater than zero for small sample sizes [14]. **(c) Imperfect reference standard.** A single erroneously annotated pixel may lead to a large decrease in performance, especially in the case of the Hausdorff Distance (HD) when applied to small structures. The Hausdorff Distance 95th Percentile (HD95), on the other hand, was designed to deal with spatial outliers. This pitfall is also relevant for localization criteria such as Box/Approx Intersection over Union (IoU) and Point inside Box/Approx. Further abbreviations: True Positive (TP), False Negative (FN), True Negative (TN).

**Figure 6:. [P3] Pitfalls related to poor metric application.**
**(a) Non-standardized metric implementation.** In the case of the Average Precision (AP) metric and the construction of the Precision- Recall (PR)-curve, the strategy of how identical scores (here: confidence score of 0.80 is present twice) are treated has a substantial impact on the metric scores. Microsoft Common Objects in Context (COCO) [20] and CityScapes [7] are used as examples. **(b) Non-independence of test cases.** The number of images taken from *Patient 1* is much higher compared to that acquired from *Patients 2–5*. Averaging over all Dice Similarity Coefficient (DSC) values, denoted by ∅, results in a high averaged score. Aggregating metric values per patient reveals much higher scores for *Patient 1* compared to the others, which would have been hidden by simple aggregation. **(c) Uninformative visualization.** A single box plot (left) does not give sufficient information about the raw metric value distribution. Adding the raw metric values as jittered dots on top (right) adds important information (here: on clusters). In the case of non-independent validation data, color/shape-coding helps reveal data clusters.

See this image and copyright information in PMC

Update of

Understanding metric-related pitfalls in image analysis validation.
Reinke A, Tizabi MD, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Kavur AE, Rädsch T, Sudre CH, Acion L, Antonelli M, Arbel T, Bakas S, Benis A, Blaschko M, Buettner F, Cardoso MJ, Cheplygina V, Chen J, Christodoulou E, Cimini BA, Collins GS, Farahani K, Ferrer L, Galdran A, van Ginneken B, Glocker B, Godau P, Haase R, Hashimoto DA, Hoffman MM, Huisman M, Isensee F, Jannin P, Kahn CE, Kainmueller D, Kainz B, Karargyris A, Karthikesalingam A, Kenngott H, Kleesiek J, Kofler F, Kooi T, Kopp-Schneider A, Kozubek M, Kreshuk A, Kurc T, Landman BA, Litjens G, Madani A, Maier-Hein K, Martel AL, Mattson P, Meijering E, Menze B, Moons KGM, Müller H, Nichyporuk B, Nickel F, Petersen J, Rafelski SM, Rajpoot N, Reyes M, Riegler MA, Rieke N, Saez-Rodriguez J, Sánchez CI, Shetty S, van Smeden M, Summers RM, Taha AA, Tiulpin A, Tsaftaris SA, Calster BV, Varoquaux G, Wiesenfarth M, Yaniv ZR, Jäger PF, Maier-Hein L. Reinke A, et al. ArXiv [Preprint]. 2024 Feb 23:arXiv:2302.01790v4. ArXiv. 2024. Update in: Nat Methods. 2024 Feb;21(2):182-194. doi: 10.1038/s41592-023-02150-0. PMID: 36945687 Free PMC article. Updated. Preprint.

References

1. Bilic Patrick, Christ Patrick, Li Hongwei Bran, Vorontsov Eugene, Ben-Cohen Avi, Kaissis Georgios, Szeskin Adi, Jacobs Colin, Mamani Gabriel Efrain Humpire, Chartrand Gabriel, et al. The liver tumor segmentation benchmark (lits). Medical Image Analysis, 84:102680, 2023. - PMC - PubMed
1. Brown Bernice B. Delphi process: a methodology used for the elicitation of opinions of experts. Technical report, Rand Corp Santa Monica CA, 1968.
1. Carbonell Alberto, De la Pena Marcos, Flores Ricardo, and Gago Selma. Effects of the trinucleotide preceding the self-cleavage site on eggplant latent viroid hammerheads: differences in co-and post-transcriptional self-cleavage may explain the lack of trinucleotide auc in most natural hammerheads. Nucleic acids research, 34(19):5613–5622, 2006. - PMC - PubMed
1. Chen Jianxu, Ding Liya, Viana Matheus P, Lee HyeonWoo, Sluezwski M Filip, Morris Benjamin, Hendershott Melissa C, Yang Ruian, Mueller Irina A, and Rafelski Susanne M. The allen cell and structure segmenter: a new open source toolkit for segmenting 3d intracellular structures in fluorescence microscopy images. BioRxiv, page 491035, 2020.
1. Chicco Davide and Jurman Giuseppe. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13, 2020. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Understanding metric-related pitfalls in image analysis validation

Affiliations

Understanding metric-related pitfalls in image analysis validation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources