A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML)

Salah S Al-Zaiti¹, Alaa A Alghwiri², Xiao Hu³, Gilles Clermont⁴, Aaron Peace⁵, Peter Macfarlane⁶, Raymond Bond⁷

Affiliations

¹ Department of Acute and Tertiary Care, Department of Emergency Medicine, and Division of Cardiology, University of Pittsburgh, Pittsburgh PA, USA.
² Data Science Core, The Provost Office, University of Pittsburgh, Pittsburgh PA, USA.
³ Center for Data Science, Emory University, Atlanta, GA, USA.
⁴ Departments of Critical Care Medicine, Mathematics, Clinical and Translational Science, and Industrial Engineering, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ The Clinical Translational Research and Innovation Centre, Northern Ireland, UK.
⁶ Institute of Health and Wellbeing, Electrocardiology Section, University of Glasgow, Glasgow, UK.
⁷ School of Computing, Ulster University, Ulster, UK.

PMID: 36713011
PMCID: PMC9708024
DOI: 10.1093/ehjdh/ztac016

A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML)

Salah S Al-Zaiti et al. Eur Heart J Digit Health. 2022.

. 2022 Apr 12;3(2):125-140.

doi: 10.1093/ehjdh/ztac016. eCollection 2022 Jun.

Authors

Salah S Al-Zaiti¹, Alaa A Alghwiri², Xiao Hu³, Gilles Clermont⁴, Aaron Peace⁵, Peter Macfarlane⁶, Raymond Bond⁷

Affiliations

¹ Department of Acute and Tertiary Care, Department of Emergency Medicine, and Division of Cardiology, University of Pittsburgh, Pittsburgh PA, USA.
² Data Science Core, The Provost Office, University of Pittsburgh, Pittsburgh PA, USA.
³ Center for Data Science, Emory University, Atlanta, GA, USA.
⁴ Departments of Critical Care Medicine, Mathematics, Clinical and Translational Science, and Industrial Engineering, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ The Clinical Translational Research and Innovation Centre, Northern Ireland, UK.
⁶ Institute of Health and Wellbeing, Electrocardiology Section, University of Glasgow, Glasgow, UK.
⁷ School of Computing, Ulster University, Ulster, UK.

PMID: 36713011
PMCID: PMC9708024
DOI: 10.1093/ehjdh/ztac016

Abstract

Developing functional machine learning (ML)-based models to address unmet clinical needs requires unique considerations for optimal clinical utility. Recent debates about the rigours, transparency, explainability, and reproducibility of ML models, terms which are defined in this article, have raised concerns about their clinical utility and suitability for integration in current evidence-based practice paradigms. This featured article focuses on increasing the literacy of ML among clinicians by providing them with the knowledge and tools needed to understand and critically appraise clinical studies focused on ML. A checklist is provided for evaluating the rigour and reproducibility of the four ML building blocks: data curation, feature engineering, model development, and clinical deployment. Checklists like this are important for quality assurance and to ensure that ML studies are rigourously and confidently reviewed by clinicians and are guided by domain knowledge of the setting in which the findings will be applied. Bridging the gap between clinicians, healthcare scientists, and ML engineers can address many shortcomings and pitfalls of ML-based solutions and their potential deployment at the bedside.

Keywords: Bias; Critical appraisal; Guidelines; Healthcare; Machine learning; Quality.

PubMed Disclaimer

Figures

**Figure 3**
Relationship between artificial intelligence, machine learning, and deep learning. Artificial intelligence is the general umbrella that encompasses machine learning with other domains, whereas deep learning is a subclass of machine learning algorithms (Credit: Salah Al-Zaiti. Created with BioRender.com).

**Figure 4**
Basic architecture of a deep neural network. The figure shows the basic architecture of a deep neural network, which is composed of an input layer (features), hidden layers (function nodes), and an output layer (prediction). The functional unit at each ‘synaptic connection’ is called a neuron and includes a summation and activation functions.

**Figure 5**
Basic architecture of a convolutional neural network. This figure illustrates how features can be extracted from a raw image (e.g. single photon emission computed tomography myocardial perfusion scan) for use in a neural network. The pixel values in each colour channel are multiplied by a kernel filter to extract low-level image features (convolution layer). Next, adjacent pixel values are grouped together using mean or max value to reduce spatial dimensionality (pooling layer). After repeated iterations of these two layers, the final byte array matrix is flattened and fed into a classical neural network to make predictions.

**Figure 6**
The iterative steps for developing a functional machine learning model. This is a simplified depiction of the actual iterative steps in building a machine learning pipeline. The first step in model development is data preprocessing followed by either supervised machine learning or unsupervised machine learning based on availability of labelled outcome data. Then, input features are pre-computed (i.e. handcrafted) or raw non-tabular data (i.e. image, waveform, etc.) is used for model development. Next, the dataset is partitioned into a training set and a testing set (usually 2:1). The training subset is further partitioned into k-folds to iteratively derive and update model hyperparameters, and the other testing subset is used to fine-tune and select the outperforming classifier or regressor. The best-performing model is then externally validated on new unseen data to determine generalizability before integration in the clinical workflow. Reproduced with permission from Helman *et al*.

**Figure 7**
Overfitting and bias-variance tradeoff in machine learning model development. (A) The simple classification case of a binary outcome (denoted by diamonds and circles) using two variables X1 and X2. Unlike the first two classifiers that focused on capturing a real association between X1, X2, and the outcome, the last classifier seemed to capture patterns in the data irrelevant to outcome of interest (e.g. confounding, redundancy, missingness, outliers, etc.), thus ‘overfitting’ the model to training data. In (B), the plot to the left demonstrates three dynamic phases of the tradeoff between bias (training error) and variance (testing error): low bias − high variance (overfitting), low bias – low variance (optimal fitting), and high bias – high variance (underfitting). The two plots to the right show the area under receiver operating characteristic curve of three classifiers (C1, C2, and C3) fitted on a training cohort of n = 745 and testing cohort of n = 499. C1 shows the lowest bias (high area under receiver operating characteristic) during training but high variance (low area under receiver operating characteristic) during testing (i.e. overfitting), whereas C3 shows the highest bias (low area under receiver operating characteristic) during training and highest variance (low area under receiver operating characteristic) during testing (i.e. underfitting).

**Figure 8**
Hypothetical tradeoff between model accuracy and model explainability. This figure shows a hypothetical relationship between model accuracy (computational cost) and model explainability. It is worth noting that this is an over-simplistic view of the relationship between these two constructs, and that the relationship between the selected classifiers is not linear. Yet, this figure emphasizes that predictive modelling is ‘mission critical’; explainable (simple) models are preferred because they will be trusted more, and thus used more. These models might also be more accurate than complex ones (note the horizontal error bars for accuracy). DL, deep learning; RF, random forest; SVM, support vector machine; K-NN, nearest neighbours; LR, logistic regression.

See this image and copyright information in PMC

References

1. Leisman DE, Harhay MO, Lederer DJ, Abramson M, Adjei AA, Bakker J, Ballas ZK, Barreiro E, Bell SC, Bellomo R, Bernstein JA, Branson RD, Brusasco V, Chalmers JD, Chokroverty S, Citerio G, Collop NA, Cooke CR, Crapo JD, Donaldson G, Fitzgerald DA, Grainger E, Hale L, Herth FJ, Kochanek PM, Marks G, Moorman JR, Ost DE, Schatz M, Sheikh A, Smyth AR, Stewart I, Stewart PW, Swenson ER, Szymusiak R, Teboul J-L, Vincent J-L, Wedzicha JA, Maslove DM. Development and reporting of prediction models: guidance for authors from editors of respiratory, sleep, and critical care journals. Crit Care Med 2020;48:623–633. - PMC - PubMed
1. Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med 2019;380:1347–1358. - PubMed
1. Al’Aref SJ, Anchouche K, Singh G, Slomka PJ, Kolli KK, Kumar A, Pandey M, Maliakal G, Van Rosendael AR, Beecy AN, Berman DS, Leipsic J, Nieman K, Andreini D, Pontone G, Schoepf UJ, Shaw LJ, Chang H-J, Narula J, Bax JJ, Guan Y, Min JK. Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging. Eur Heart J 2019;40:1975–1986. - PubMed
1. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nat Med 2019;25:30–36. - PMC - PubMed
1. Kagiyama N, Shrestha S, Farjo PD, Sengupta PP. Artificial intelligence: practical primer for clinical research in cardiovascular disease. J Am Heart Assoc 2019;8:e012788. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML)

Affiliations

A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML)

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous