. 2018 Dec 6;9(1):5217.

doi: 10.1038/s41467-018-07619-7.

Why rankings of biomedical image analysis competitions should be interpreted with care

Lena Maier-Hein¹, Matthias Eisenmann², Annika Reinke², Sinan Onogur², Marko Stankovic², Patrick Scholz², Tal Arbel³, Hrvoje Bogunovic⁴, Andrew P Bradley⁵, Aaron Carass⁶, Carolin Feldmann², Alejandro F Frangi⁷, Peter M Full², Bram van Ginneken⁸, Allan Hanbury^{9

10}, Katrin Honauer¹¹, Michal Kozubek¹², Bennett A Landman¹³, Keno März², Oskar Maier¹⁴, Klaus Maier-Hein¹⁵, Bjoern H Menze¹⁶, Henning Müller¹⁷, Peter F Neher¹⁵, Wiro Niessen¹⁸, Nasir Rajpoot¹⁹, Gregory C Sharp²⁰, Korsuk Sirinukunwattana²¹, Stefanie Speidel²², Christian Stock²³, Danail Stoyanov²⁴, Abdel Aziz Taha²⁵, Fons van der Sommen²⁶, Ching-Wei Wang²⁷, Marc-André Weber²⁸, Guoyan Zheng²⁹, Pierre Jannin³⁰, Annette Kopp-Schneider³¹

Affiliations

¹ Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany. l.maier-hein@dkfz.de.
² Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
³ Centre for Intelligent Machines, McGill University, Montreal, QC, H3A0G4, Canada.
⁴ Christian Doppler Laboratory for Ophthalmic Image Analysis, Department of Ophthalmology, Medical University Vienna, 1090, Vienna, Austria.
⁵ Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD, 4001, Australia.
⁶ Department of Electrical and Computer Engineering, Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
⁷ CISTIB - Center for Computational Imaging & Simulation Technologies in Biomedicine, The University of Leeds, Leeds, Yorkshire, LS2 9JT, UK.
⁸ Department of Radiology and Nuclear Medicine, Medical Image Analysis, Radboud University Center, 6525 GA, Nijmegen, The Netherlands.
⁹ Institute of Information Systems Engineering, TU Wien, 1040, Vienna, Austria.
¹⁰ Complexity Science Hub Vienna, 1080, Vienna, Austria.
¹¹ Heidelberg Collaboratory for Image Processing (HCI), Heidelberg University, 69120, Heidelberg, Germany.
¹² Centre for Biomedical Image Analysis, Masaryk University, 60200, Brno, Czech Republic.
¹³ Electrical Engineering, Vanderbilt University, Nashville, TN, 37235-1679, USA.
¹⁴ Institute of Medical Informatics, Universität zu Lübeck, 23562, Lübeck, Germany.
¹⁵ Division of Medical Image Computing (MIC), German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
¹⁶ Institute for Advanced Studies, Department of Informatics, Technical University of Munich, 80333, Munich, Germany.
¹⁷ Information System Institute, HES-SO, Sierre, 3960, Switzerland.
¹⁸ Departments of Radiology, Nuclear Medicine and Medical Informatics, Erasmus MC, 3015 GD, Rotterdam, The Netherlands.
¹⁹ Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK.
²⁰ Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA.
²¹ Institute of Biomedical Engineering, University of Oxford, Oxford, OX3 7DQ, UK.
²² Division of Translational Surgical Oncology (TCO), National Center for Tumor Diseases Dresden, 01307, Dresden, Germany.
²³ Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
²⁴ Centre for Medical Image Computing (CMIC) & Department of Computer Science, University College London, London, W1W 7TS, UK.
²⁵ Data Science Studio, Research Studios Austria FG, 1090, Vienna, Austria.
²⁶ Department of Electrical Engineering, Eindhoven University of Technology, 5600 MB, Eindhoven, The Netherlands.
²⁷ AIExplore, NTUST Center of Computer Vision and Medical Imaging, Graduate Institute of Biomedical Engineering, National Taiwan University of Science and Technology, Taipei, 106, Taiwan.
²⁸ Institute of Diagnostic and Interventional Radiology, University Medical Center Rostock, 18051, Rostock, Germany.
²⁹ Institute for Surgical Technology and Biomechanics, University of Bern, Bern, 3014, Switzerland.
³⁰ Univ Rennes, Inserm, LTSI (Laboratoire Traitement du Signal et de l'Image) - UMR_S 1099, Rennes, 35043, Cedex, France.
³¹ Division of Biostatistics, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.

PMID: 30523263
PMCID: PMC6284017
DOI: 10.1038/s41467-018-07619-7

Why rankings of biomedical image analysis competitions should be interpreted with care

Lena Maier-Hein et al. Nat Commun. 2018.

. 2018 Dec 6;9(1):5217.

doi: 10.1038/s41467-018-07619-7.

Authors

Affiliations

¹ Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany. l.maier-hein@dkfz.de.
² Division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
³ Centre for Intelligent Machines, McGill University, Montreal, QC, H3A0G4, Canada.
⁴ Christian Doppler Laboratory for Ophthalmic Image Analysis, Department of Ophthalmology, Medical University Vienna, 1090, Vienna, Austria.
⁵ Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD, 4001, Australia.
⁶ Department of Electrical and Computer Engineering, Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
⁷ CISTIB - Center for Computational Imaging & Simulation Technologies in Biomedicine, The University of Leeds, Leeds, Yorkshire, LS2 9JT, UK.
⁸ Department of Radiology and Nuclear Medicine, Medical Image Analysis, Radboud University Center, 6525 GA, Nijmegen, The Netherlands.
⁹ Institute of Information Systems Engineering, TU Wien, 1040, Vienna, Austria.
¹⁰ Complexity Science Hub Vienna, 1080, Vienna, Austria.
¹¹ Heidelberg Collaboratory for Image Processing (HCI), Heidelberg University, 69120, Heidelberg, Germany.
¹² Centre for Biomedical Image Analysis, Masaryk University, 60200, Brno, Czech Republic.
¹³ Electrical Engineering, Vanderbilt University, Nashville, TN, 37235-1679, USA.
¹⁴ Institute of Medical Informatics, Universität zu Lübeck, 23562, Lübeck, Germany.
¹⁵ Division of Medical Image Computing (MIC), German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
¹⁶ Institute for Advanced Studies, Department of Informatics, Technical University of Munich, 80333, Munich, Germany.
¹⁷ Information System Institute, HES-SO, Sierre, 3960, Switzerland.
¹⁸ Departments of Radiology, Nuclear Medicine and Medical Informatics, Erasmus MC, 3015 GD, Rotterdam, The Netherlands.
¹⁹ Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK.
²⁰ Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA, 02114, USA.
²¹ Institute of Biomedical Engineering, University of Oxford, Oxford, OX3 7DQ, UK.
²² Division of Translational Surgical Oncology (TCO), National Center for Tumor Diseases Dresden, 01307, Dresden, Germany.
²³ Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.
²⁴ Centre for Medical Image Computing (CMIC) & Department of Computer Science, University College London, London, W1W 7TS, UK.
²⁵ Data Science Studio, Research Studios Austria FG, 1090, Vienna, Austria.
²⁶ Department of Electrical Engineering, Eindhoven University of Technology, 5600 MB, Eindhoven, The Netherlands.
²⁷ AIExplore, NTUST Center of Computer Vision and Medical Imaging, Graduate Institute of Biomedical Engineering, National Taiwan University of Science and Technology, Taipei, 106, Taiwan.
²⁸ Institute of Diagnostic and Interventional Radiology, University Medical Center Rostock, 18051, Rostock, Germany.
²⁹ Institute for Surgical Technology and Biomechanics, University of Bern, Bern, 3014, Switzerland.
³⁰ Univ Rennes, Inserm, LTSI (Laboratoire Traitement du Signal et de l'Image) - UMR_S 1099, Rennes, 35043, Cedex, France.
³¹ Division of Biostatistics, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.

PMID: 30523263
PMCID: PMC6284017
DOI: 10.1038/s41467-018-07619-7

Erratum in

Author Correction: Why rankings of biomedical image analysis competitions should be interpreted with care.
Maier-Hein L, Eisenmann M, Reinke A, Onogur S, Stankovic M, Scholz P, Arbel T, Bogunovic H, Bradley AP, Carass A, Feldmann C, Frangi AF, Full PM, van Ginneken B, Hanbury A, Honauer K, Kozubek M, Landman BA, März K, Maier O, Maier-Hein K, Menze BH, Müller H, Neher PF, Niessen W, Rajpoot N, Sharp GC, Sirinukunwattana K, Speidel S, Stock C, Stoyanov D, Taha AA, van der Sommen F, Wang CW, Weber MA, Zheng G, Jannin P, Kopp-Schneider A. Maier-Hein L, et al. Nat Commun. 2019 Jan 30;10(1):588. doi: 10.1038/s41467-019-08563-w. Nat Commun. 2019. PMID: 30700735 Free PMC article.

Abstract

International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the importance of challenges and show that the lack of quality control has critical consequences. First, reproducibility and interpretation of the results is often hampered as only a fraction of relevant information is typically provided. Second, the rank of an algorithm is generally not robust to a number of variables such as the test data used for validation, the ranking scheme applied and the observers that make the reference annotations. To overcome these problems, we recommend best practice guidelines and define open research questions to be addressed in the future.

PubMed Disclaimer

Conflict of interest statement

Henning Müller is on the advisory board of “Zebra Medical Vision”. Danail Stoyanov is a paid part-time member of Touch Surgery, Kinosis Ltd. The remaining authors declare no competing interests.

Figures

**Fig. 1**
Overview of biomedical image analysis challenges. a Number of competitions (challenges and tasks) organized per year, b fields of application, c algorithm categories assessed in the challenges, d imaging techniques applied, e number of training and test cases used, f most commonly applied metrics for performance assessment used in at least 5 tasks, and g platforms (e.g. conferences) used to organize the challenges for the years 2008, 2012, and 2016

**Fig. 2**
Robustness of rankings with respect to several challenge design choices. One data point corresponds to one segmentation task organized in 2015 (n = 56). The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1.5 interquartile range (IQR) of the first quartile, and the highest value still within 1.5 IQR of the third quartile. a Ranking (metric-based) with the standard Hausdorff Distance (HD) vs. its 95% variant (HD95). b Mean vs. median in metric-based ranking based on the HD. c Case-based (rank per case, then aggregate with mean) vs. metric-based (aggregate with mean, then rank) ranking in single-metric ranking based on the HD. d Metric values per algorithm and rankings for reference annotations performed by two different observers. In the box plots (a–c), descriptive statistics for Kendall’s tau, which quantifies differences between rankings (1: identical ranking; −1: inverse ranking), is shown. Key examples (red circles) illustrate that slight changes in challenge design may lead to the worst algorithm (A_i: Algorithm i) becoming the winner (a) or to almost all teams changing their ranking position (d). Even for relatively high values of Kendall’s tau (b: tau = 0.74; c: tau = 0.85), critical changes in the ranking may occur

**Fig. 3**
The ranking scheme is a deciding factor for the ranking robustness. The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1.5 interquartile range (IQR) of the first quartile, and the highest value still within 1.5 IQR of the third quartile. According to bootstrapping experiments with 2015 segmentation challenge data, single-metric based rankings (those shown here are for the DSC) are significantly more robust when the mean rather than the median is used for aggregation (left) and when the ranking is performed after aggregation rather than before (right). One data point represents the robustness of one task, quantified by the percentage of simulations in bootstrapping experiments in which the winner remains the winner

**Fig. 4**
Robustness of rankings with respect to the data used. Robustness of rankings with respect to the data used when a single-metric ranking scheme based on whether the Dice Similarity Coefficient (DSC) (left), the Hausdorff Distance (HD) (middle) or the 95% variant of the HD (right) is applied. One data point corresponds to one segmentation task organized in 2015 (n = 56). The center line in the boxplots shows the median, the lower, and upper border of the box represent the first and third quartile. The whiskers extend to the lowest value still within 1.5 interquartile range (IQR) of the first quartile, and the highest value still within 1.5 IQR of the third quartile. Metric-based aggregation with mean was performed in all experiments. Top: percentage of simulations in bootstrapping experiments in which the winner (according to the respective metric) remains the winner. Bottom: percentage of other participating teams that were ranked first in the simulations

**Fig. 5**
Main results of the international questionnaire on biomedical challenges. Issues raised by the participants were related to the challenge data, the data annotation, the evaluation (including choice of metrics and ranking schemes) and the documentation of challenge results

See this image and copyright information in PMC

References

1. Ayache N, Duncan J. 20th anniversary of the medical image analysis journal (MedIA) Med. Image Anal. 2016;33:1–3. doi: 10.1016/j.media.2016.07.004. - DOI - PubMed
1. Chen, W. Li, W. Dong, X. Pei, J. A review of biological image analysis. Curr. Bioinform. 13, 337–343 (2018).
1. Price K. Anything you can do, I can do better (no you can’t) Comput. Gr. Image Process. 1986;36:387–391. doi: 10.1016/0734-189X(86)90083-6. - DOI
1. West J, et al. Comparison and evaluation of retrospective intermodality brain image registration techniques. J. Comput. Assist. Tomogr. 1997;21:554–568. doi: 10.1097/00004728-199707000-00007. - DOI - PubMed
1. Müller H, Rosset A, Vallée JP, Terrier F, Geissbuhler A. A reference data set for the evaluation of medical image retrieval systems. Comput. Med. Imaging Graph. 2004;28:295–305. doi: 10.1016/j.compmedimag.2004.04.005. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Why rankings of biomedical image analysis competitions should be interpreted with care

Affiliations

Why rankings of biomedical image analysis competitions should be interpreted with care

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical