Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays

Tristan Till¹, Mario Scherkl¹, Nikolaus Stranger², Georg Singer³, Saskia Hankel³, Christina Flucher³, Franko Hržić⁴, Ivan Štajduhar⁴, Sebastian Tschauner¹

Affiliations

¹ Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
² Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria. nikolaus.stranger@medunigraz.at.
³ Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
⁴ Faculty of Engineering, Department of Computer Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia.

PMID: 40379941
PMCID: PMC12559094
DOI: 10.1007/s00330-025-11669-z

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays

Tristan Till et al. Eur Radiol. 2025 Nov.

. 2025 Nov;35(11):6853-6864.

doi: 10.1007/s00330-025-11669-z. Epub 2025 May 16.

Authors

Tristan Till¹, Mario Scherkl¹, Nikolaus Stranger², Georg Singer³, Saskia Hankel³, Christina Flucher³, Franko Hržić⁴, Ivan Štajduhar⁴, Sebastian Tschauner¹

Affiliations

¹ Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
² Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria. nikolaus.stranger@medunigraz.at.
³ Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
⁴ Faculty of Engineering, Department of Computer Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia.

PMID: 40379941
PMCID: PMC12559094
DOI: 10.1007/s00330-025-11669-z

Abstract

Objectives: To evaluate how different test set sampling strategies-random selection and balanced sampling-affect the performance of artificial intelligence (AI) models in pediatric wrist fracture detection using radiographs, aiming to highlight the need for standardization in test set design.

Materials and methods: This retrospective study utilized the open-sourced GRAZPEDWRI-DX dataset of 6091 pediatric wrist radiographs. Two test sets, each containing 4588 images, were constructed: one using a balanced approach based on case difficulty, projection type, and fracture presence and the other a random selection. EfficientNet and YOLOv11 models were trained and validated on 18,762 radiographs and tested on both sets. Binary classification and object detection tasks were evaluated using metrics such as precision, recall, F1 score, AP50, and AP50-95. Statistical comparisons between test sets were performed using nonparametric tests.

Results: Performance metrics significantly decreased in the balanced test set with more challenging cases. For example, the precision for YOLOv11 models decreased from 0.95 in the random set to 0.83 in the balanced set. Similar trends were observed for recall, accuracy, and F1 score, indicating that models trained on easy-to-recognize cases performed poorly on more complex ones. These results were consistent across all model variants tested.

Conclusion: AI models for pediatric wrist fracture detection exhibit reduced performance when tested on balanced datasets containing more difficult cases, compared to randomly selected cases. This highlights the importance of constructing representative and standardized test sets that account for clinical complexity to ensure robust AI performance in real-world settings.

Key points: Question Do different sampling strategies based on samples' complexity have an influence in deep learning models' performance in fracture detection? Findings AI performance in pediatric wrist fracture detection significantly drops when tested on balanced datasets with more challenging cases, compared to randomly selected cases. Clinical relevance Without standardized and validated test datasets for AI that reflect clinical complexities, performance metrics may be overestimated, limiting the utility of AI in real-world settings.

Keywords: Artificial intelligence, Pediatric radiology, Fracture detection, Radiographs, Test sets.

PubMed Disclaimer

Conflict of interest statement

Compliance with ethical standards. Guarantor: The scientific guarantor of this publication is Nikolaus Stranger. Conflict of interest: The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article. Statistics and biometry: One of the authors has significant statistical expertise. Informed consent: Written informed consent was waived by the Institutional Review Board. Ethical approval: Institutional Review Board approval was obtained. Study subjects or cohorts overlap: This study used the open-sourced GRAZPEDWRI-DX dataset of 6091 pediatric wrist radiographs. Methodology: Retrospective Diagnostic or prognostic study Performed at one institution

Figures

**Fig. 1**
Flowchart of the dataset [23], including subsamples, splits and matching procedures to generate the datasets ”balanced”, ”random” and ”overlap”

**Fig. 2**
Confusion matrices of EfficientNet predictions vs. the ground truth of the presence of fractures. Matrices are given for “balanced” and “random” subsets. The numbers are normalized between 0 and 1 across all EfficientNet variants (B0 to B7)

**Fig. 3**
Scatter plots of EfficientNet metrics for both test sets, “balanced” and “random”

**Fig. 4**
ROC analysis of the “overlap” subset for combined EfficientNet variants B0 to B7 of “balanced” = dashed line graphs and “random” = full line graphs for easy (blue, n = 247), difficult (red, n = 280) and samples without fracture (n = 495). AUC values were comparable between the datasets, without relevant advantages of the “balanced” subset featuring more difficult samples during training and validation

**Fig. 5**
Precision-Recall curves for YOLOv11 models “n”, “s”, “m”, “l”, and “x”, compared between “balanced” (red) and “random” (blue) datasets

**Fig. 6**
Precision-Recall curves of the “overlap” set for YOLOv11 models “n”, “s”, “m”, “l”, and “x”, compared between “balanced” (dashed line) and “random” (solid line) training data. The results for the difficult cases (red) were comparable between the “balanced” and “random” training sets. There was a drop in performance in the easy cases of the “balanced” compared to the “random” subset. This could be explained by the lower number of easy samples in the “balanced” subset

See this image and copyright information in PMC

References

1. Kempen EJ, Post M, Mannil M et al (2021) Performance of machine learning algorithms for glioma segmentation of brain MRI: a systematic literature review and metaanalysis. Eur Radiol 31:9638–9653. 10.1007/s00330-021-08035-0 - DOI - PMC - PubMed
1. Freeman K, Geppert J, Stinton C et al (2021) Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 1872. 10.1136/bmj.n1872 - PMC - PubMed
1. Bluemke DA, Moy L, Bredella MA et al (2020) Assessing radiology research on artificial intelligence: a brief guide for authors, reviewers, and readers—from the radiology editorial board. Radiology 294:487–489. 10.1148/radiol.2019192515 - DOI - PubMed
1. Armato SG, Drukker K, Hadjiiski L (2023) AI in medical imaging grand challenges: translation from competition to research benefit and patient care. Br J Radiol 96. 10.1259/bjr.20221152 - PMC - PubMed
1. US Food and Drug Administration et al (2024) Artificial intelligence & medical products: how CBER, CDER, CDRH, and OCP are Working Together. US Food and Drug Administration

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Springer
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays

Affiliations

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical