Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 5;6(9):e2335377.
doi: 10.1001/jamanetworkopen.2023.35377.

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Affiliations

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Jethro C C Kwong et al. JAMA Netw Open. .

Abstract

Importance: Artificial intelligence (AI) has gained considerable attention in health care, yet concerns have been raised around appropriate methods and fairness. Current AI reporting guidelines do not provide a means of quantifying overall quality of AI research, limiting their ability to compare models addressing the same clinical question.

Objective: To develop a tool (APPRAISE-AI) to evaluate the methodological and reporting quality of AI prediction models for clinical decision support.

Design, setting, and participants: This quality improvement study evaluated AI studies in the model development, silent, and clinical trial phases using the APPRAISE-AI tool, a quantitative method for evaluating quality of AI studies across 6 domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality, and reproducibility. These domains included 24 items with a maximum overall score of 100 points. Points were assigned to each item, with higher points indicating stronger methodological or reporting quality. The tool was applied to a systematic review on machine learning to estimate sepsis that included articles published until September 13, 2019. Data analysis was performed from September to December 2022.

Main outcomes and measures: The primary outcomes were interrater and intrarater reliability and the correlation between APPRAISE-AI scores and expert scores, 3-year citation rate, number of Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) low risk-of-bias domains, and overall adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement.

Results: A total of 28 studies were included. Overall APPRAISE-AI scores ranged from 33 (low quality) to 67 (high quality). Most studies were moderate quality. The 5 lowest scoring items included source of data, sample size calculation, bias assessment, error analysis, and transparency. Overall APPRAISE-AI scores were associated with expert scores (Spearman ρ, 0.82; 95% CI, 0.64-0.91; P < .001), 3-year citation rate (Spearman ρ, 0.69; 95% CI, 0.43-0.85; P < .001), number of QUADAS-2 low risk-of-bias domains (Spearman ρ, 0.56; 95% CI, 0.24-0.77; P = .002), and adherence to the TRIPOD statement (Spearman ρ, 0.87; 95% CI, 0.73-0.94; P < .001). Intraclass correlation coefficient ranges for interrater and intrarater reliability were 0.74 to 1.00 for individual items, 0.81 to 0.99 for individual domains, and 0.91 to 0.98 for overall scores.

Conclusions and relevance: In this quality improvement study, APPRAISE-AI demonstrated strong interrater and intrarater reliability and correlated well with several study quality measures. This tool may provide a quantitative approach for investigators, reviewers, editors, and funding organizations to compare the research quality across AI studies for clinical decision support.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Dr McDermott reported receiving personal fees from FL84 for consulting work performed for machine learning over health data, outside the submitted work. Dr Kulkarni reported receiving personal fees from Janssen, Theralase Inc, Merck Sharp & Dohme, Bristol-Myers Squibb, Emmanuel Merck Darmstadt Serono, Photocure, Advanced Accelerators Applications Novartis, Verity Pharmaceuticals, Ferring, TerSera, Knight Therapeutics, Abbvie, and Tolmar outside the submitted work. No other disclosures were reported.

Figures

Figure.
Figure.. Mean APPRAISE-AI Item, Domain, and Overall Scores for the 28 Studies Using Artificial Intelligence to Predict Sepsis
Each field is presented as a percentage of the maximum possible score for that field (ie, mean score / maximum possible score × 100%) to compare scores between fields, irrespective of the assigned weighting.

Similar articles

Cited by

References

    1. Liu X, Glocker B, McCradden MM, Ghassemi M, Denniston AK, Oakden-Rayner L. The medical algorithmic audit. Lancet Digit Health. 2022;4(5):e384-e397. doi:10.1016/S2589-7500(22)00003-6 - DOI - PubMed
    1. Dhiman P, Ma J, Andaur Navarro CL, et al. . Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review. BMC Med Res Methodol. 2022;22(1):101. doi:10.1186/s12874-022-01577-x - DOI - PMC - PubMed
    1. Collins GS, Dhiman P, Andaur Navarro CL, et al. . Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7):e048008. doi:10.1136/bmjopen-2020-048008 - DOI - PMC - PubMed
    1. Sounderajah V, Ashrafian H, Golub RM, et al. ; STARD-AI Steering Committee . Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021;11(6):e047709. doi:10.1136/bmjopen-2020-047709 - DOI - PMC - PubMed
    1. Norgeot B, Quer G, Beaulieu-Jones BK, et al. . Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26(9):1320-1324. doi:10.1038/s41591-020-1041-y - DOI - PMC - PubMed

Publication types