Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 May 31;11(3):e12039.
doi: 10.1002/jeo2.12039. eCollection 2024 Jul.

A practical guide to the implementation of AI in orthopaedic research, Part 6: How to evaluate the performance of AI research?

Affiliations
Review

A practical guide to the implementation of AI in orthopaedic research, Part 6: How to evaluate the performance of AI research?

Felix C Oettl et al. J Exp Orthop. .

Abstract

Artificial intelligence's (AI) accelerating progress demands rigorous evaluation standards to ensure safe, effective integration into healthcare's high-stakes decisions. As AI increasingly enables prediction, analysis and judgement capabilities relevant to medicine, proper evaluation and interpretation are indispensable. Erroneous AI could endanger patients; thus, developing, validating and deploying medical AI demands adhering to strict, transparent standards centred on safety, ethics and responsible oversight. Core considerations include assessing performance on diverse real-world data, collaborating with domain experts, confirming model reliability and limitations, and advancing interpretability. Thoughtful selection of evaluation metrics suited to the clinical context along with testing on diverse data sets representing different populations improves generalisability. Partnering software engineers, data scientists and medical practitioners ground assessment in real needs. Journals must uphold reporting standards matching AI's societal impacts. With rigorous, holistic evaluation frameworks, AI can progress towards expanding healthcare access and quality.

Level of evidence: Level V.

Keywords: AI; ML; digitalization; healthcare; performance metrics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Area under the curve (AUC)–receiver operator curve (ROC) graph, the orange line displays the models true positive rate and false positive rate at various thresholds, the dashed blue line represents an AUC of 0.5, no better than chance.

Similar articles

Cited by

References

    1. Abdar, M. , Pourpanah, F. , Hussain, S. , Rezazadegan, D. , Liu, L. , Ghavamzadeh, M. et al. (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion, 76, 243–297. Available from: 10.1016/j.inffus.2021.05.008 - DOI
    1. Adams, L.C. , Busch, F. , Truhn, D. , Makowski, M.R. , Aerts, H.J.W.L. & Bressem, K.K. (2023) What does DALL‐E 2 know about radiology? Journal of Medical Internet Research, 25, e43110. Available from: 10.2196/43110 - DOI - PMC - PubMed
    1. Ashraf, S. , Wibberley, H. , Mapp, P.I. , Hill, R. , Wilson, D. & Walsh, D.A. (2011) Increased vascular penetration and nerve growth in the meniscus: a potential source of pain in osteoarthritis. Annals of the Rheumatic Diseases, 70, 523–529. Available from: 10.1136/ard.2010.137844 - DOI - PubMed
    1. Box, G.E.P. (1976) Science and statistics. Journal of the American Statistical Association, 71, 791–799. Available from: 10.1080/01621459.1976.10480949 - DOI
    1. Chen, A. , Stanovsky, G. , Singh, S. & Gardner, M. (2019) Evaluating question answering evaluation. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 1 January 2019. Hong Kong, China: Association for Computational Linguistics, pp. 119–124. Available from: 10.18653/v1/D19-5817 - DOI