Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: an introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal

Jakub Olczak¹, John Pavlopoulos², Jasper Prijs^{3

4}, Frank F A Ijpma^{4

5}, Job N Doornberg^{3

4

5}, Claes Lundström⁶, Joel Hedlund⁶, Max Gordon¹

Affiliations

¹ Institute of Clinical Sciences, Danderyd University Hospital, Karolinska Institute, Sweden.
² Department of Computer and System Sciences, Stockholm University, Sweden.
³ Flinders University, Adelaide, Australia.
⁴ Department of Trauma Surgery, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.
⁵ The Machine Learning Consortium.
⁶ Center for Medical Image Science and Visualization, Linköping University, Sweden.

PMID: 33988081
PMCID: PMC8519529
DOI: 10.1080/17453674.2021.1918389

Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: an introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal

Jakub Olczak et al. Acta Orthop. 2021 Oct.

. 2021 Oct;92(5):513-525.

doi: 10.1080/17453674.2021.1918389. Epub 2021 May 14.

Authors

Jakub Olczak¹, John Pavlopoulos², Jasper Prijs^{3

4}, Frank F A Ijpma^{4

5}, Job N Doornberg^{3

4

5}, Claes Lundström⁶, Joel Hedlund⁶, Max Gordon¹

Affiliations

¹ Institute of Clinical Sciences, Danderyd University Hospital, Karolinska Institute, Sweden.
² Department of Computer and System Sciences, Stockholm University, Sweden.
³ Flinders University, Adelaide, Australia.
⁴ Department of Trauma Surgery, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.
⁵ The Machine Learning Consortium.
⁶ Center for Medical Image Science and Visualization, Linköping University, Sweden.

PMID: 33988081
PMCID: PMC8519529
DOI: 10.1080/17453674.2021.1918389

Abstract

Background and purpose - Artificial intelligence (AI), deep learning (DL), and machine learning (ML) have become common research fields in orthopedics and medicine in general. Engineers perform much of the work. While they gear the results towards healthcare professionals, the difference in competencies and goals creates challenges for collaboration and knowledge exchange. We aim to provide clinicians with a context and understanding of AI research by facilitating communication between creators, researchers, clinicians, and readers of medical AI and ML research.Methods and results - We present the common tasks, considerations, and pitfalls (both methodological and ethical) that clinicians will encounter in AI research. We discuss the following topics: labeling, missing data, training, testing, and overfitting. Common performance and outcome measures for various AI and ML tasks are presented, including accuracy, precision, recall, F1 score, Dice score, the area under the curve, and ROC curves. We also discuss ethical considerations in terms of privacy, fairness, autonomy, safety, responsibility, and liability regarding data collecting or sharing.Interpretation - We have developed guidelines for reporting medical AI research to clinicians in the run-up to a broader consensus process. The proposed guidelines consist of a Clinical Artificial Intelligence Research (CAIR) checklist and specific performance metrics guidelines to present and evaluate research using AI components. Researchers, engineers, clinicians, and other stakeholders can use these proposal guidelines and the CAIR checklist to read, present, and evaluate AI research geared towards a healthcare setting.

PubMed Disclaimer

Figures

**Figure 1.**
Confusion matrix for an ankle fracture classification experiment, according to Danis-Weber (AO Foundation/Orthopedic Trauma Association (AO/OTA)) classification. There are 26 type A fractures, 137 type B fractures, and 47 type C fractures. Data reproduced from (Olczak et al. 2020).

**Figure 2.**
Graphical illustration of precision and sensitivity (or recall). Circles, “●,” represent cases without the disease/class. Bullets, “●,” represent cases with the disease/class.

**Figure 3.**
ROC and PR curves for malleolar class predictions. The ROC curves (left) are monotonically growing functions of sensitivity (y-axis) and the FPR (x-axis). The AUC of the ROC curve corresponds to overall model accuracy. The PR-curves (right) have precision on the y-axis and sensitivity on the x-axis. Unlike the ROC, we see that it can oscillate and tends towards zero. The differences between the outcomes are also greater.

**Figure 4.**
Comparing the IoU and the F1 score in terms of data overlap. The overlapping sets illustrate why both are commonly used performance measures in object detection and image segmentation. The IoU is the percentage of area overlap of correct detection. The F1-score is the “harmonic mean” where the TPs are given additional importance. We can transform one into the other (see supplement). See Table 1 for how to compute IoU and F1 score.

**Figure 5.**
Recommendations for choosing outcome metrics suitable for clinicians. The selected measures are selected for their (1) suitability and (2) their interpretability to a clinician. Deviations from these are possible; however, they need to be motivated, and we recommend also reporting these metrics. IoU (Intersection over Union); ROI (Region of Interest); MAE (Mean Average Error); RMSE (Root Mean Squared Error); AUC (Area Under the Receiver Operating Characteristic curve; AUPR (Area Under the Precision-Recall curve).

See this image and copyright information in PMC

Cited by

Guidelines for Artificial Intelligence in Medicine: Literature Review and Content Analysis of Frameworks.
Crossnohere NL, Elsaid M, Paskett J, Bose-Brill S, Bridges JFP. Crossnohere NL, et al. J Med Internet Res. 2022 Aug 25;24(8):e36823. doi: 10.2196/36823. J Med Internet Res. 2022. PMID: 36006692 Free PMC article. Review.
Finding the Best Match - a Case Study on the (Text-)Feature and Model Choice in Digital Mental Health Interventions.
Zantvoort K, Scharfenberger J, Boß L, Lehr D, Funk B. Zantvoort K, et al. J Healthc Inform Res. 2023 Sep 18;7(4):447-479. doi: 10.1007/s41666-023-00148-z. eCollection 2023 Dec. J Healthc Inform Res. 2023. PMID: 37927375 Free PMC article.
Artificial intelligence in the risk prediction models of cardiovascular disease and development of an independent validation screening tool: a systematic review.
Cai Y, Cai YQ, Tang LY, Wang YH, Gong M, Jing TC, Li HJ, Li-Ling J, Hu W, Yin Z, Gong DX, Zhang GW. Cai Y, et al. BMC Med. 2024 Feb 5;22(1):56. doi: 10.1186/s12916-024-03273-7. BMC Med. 2024. PMID: 38317226 Free PMC article.
Predicting maternal risk level using machine learning models.
Al Mashrafi SS, Tafakori L, Abdollahian M. Al Mashrafi SS, et al. BMC Pregnancy Childbirth. 2024 Dec 18;24(1):820. doi: 10.1186/s12884-024-07030-9. BMC Pregnancy Childbirth. 2024. PMID: 39695398 Free PMC article.
TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods.
Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, Ghassemi M, Liu X, Reitsma JB, van Smeden M, Boulesteix AL, Camaradou JC, Celi LA, Denaxas S, Denniston AK, Glocker B, Golub RM, Harvey H, Heinze G, Hoffman MM, Kengne AP, Lam E, Lee N, Loder EW, Maier-Hein L, Mateen BA, McCradden MD, Oakden-Rayner L, Ordish J, Parnell R, Rose S, Singh K, Wynants L, Logullo P. Collins GS, et al. BMJ. 2024 Apr 16;385:e078378. doi: 10.1136/bmj-2023-078378. BMJ. 2024. PMID: 38626948 Free PMC article.

See all "Cited by" articles

References

1. Adamson A S, Smith A.. Machine learning and health care disparities in dermatology. JAMA Dermatol 2018; 154(11): 1247. - PubMed
1. Anderson P, Fernando B, Johnson M, Gould S. SPICE: Semantic Propositional Image Caption Evaluation. arXiv:160708822 [cs] [Internet] 2016. Jul 29 [cited 2020 Nov 30]. Available from: http://arxiv.org/abs/1607.08822
1. Badgeley M A, Zech J R, Oakden-Rayner L, Glicksberg B S, Liu M, Gale W, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med 2019; 2(1): 1–10. - PMC - PubMed
1. Bandos A I, Obuchowski N A.. Evaluation of diagnostic accuracy in free-response detection-localization tasks using ROC tools. Stat Methods Med Res 2019; 28(6): 1808–25. - PubMed
1. Banerjee S, Lavie A. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization [Internet]. Ann Arbor, MI: Association for Computational Linguistics; 2005. [cited 2020 Nov 30]. p. 65–72. Available from: https://www.aclweb.org/anthology/W05-0909

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- figshare - Access datasets and other research materials.
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: an introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal

Affiliations

Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: an introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources