Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2025 Jan;31(1):189-196.
doi: 10.1038/s41591-024-03329-4. Epub 2025 Jan 2.

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Affiliations
Multicenter Study

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Filip Christiansen et al. Nat Med. 2025 Jan.

Abstract

Ovarian lesions are common and often incidentally detected. A critical shortage of expert ultrasound examiners has raised concerns of unnecessary interventions and delayed cancer diagnoses. Deep learning has shown promising results in the detection of ovarian cancer in ultrasound images; however, external validation is lacking. In this international multicenter retrospective study, we developed and validated transformer-based neural network models using a comprehensive dataset of 17,119 ultrasound images from 3,652 patients across 20 centers in eight countries. Using a leave-one-center-out cross-validation scheme, for each center in turn, we trained a model using data from the remaining centers. The models demonstrated robust performance across centers, ultrasound systems, histological diagnoses and patient age groups, significantly outperforming both expert and non-expert examiners on all evaluated metrics, namely F1 score, sensitivity, specificity, accuracy, Cohen's kappa, Matthew's correlation coefficient, diagnostic odds ratio and Youden's J statistic. Furthermore, in a retrospective triage simulation, artificial intelligence (AI)-driven diagnostic support reduced referrals to experts by 63% while significantly surpassing the diagnostic performance of the current practice. These results show that transformer-based models exhibit strong generalization and above human expert-level diagnostic accuracy, with the potential to alleviate the shortage of expert ultrasound examiners and improve patient outcomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: E.E., K.S., F.C., E.K. and P.H. have applied for a patent (European patent application 23220765.4) that is pending to a company named Intelligyn. The patent covers methods for a computer-aided diagnostic system to improve generalization and protect against bias. E.E., K.S. and F.C. hold stock in Intelligyn, where E.E. also has an unpaid leadership role. N.C.P.’s institution has received payments for activities not related to this article, including lectures, presentations, expert testimonies, and service on speakers’ bureaus, as well as for travel support. N.C.P. has been an advisory board member of Mindray and GE Healthcare and has held unpaid leadership roles in the POGS Organization of Government Institutions (and the Rizal Medical Service Delivery Network, which are Philippine governmental institutions with the aim to facilitate smooth referral of patients. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Paired F1 scores between human examiners and AI models.
a, Paired F1 scores between individual examiners (n = 66) and the AI models on matched case sets, that is, each examiner is compared against the AI models on the set of cases he or she assessed. A dot above the dashed line corresponds to an individual examiner that was outperformed by the AI models on the same set of cases. b,c, Paired F1 scores between (b) expert examiners (n = 33; orange) and AI models (blue), and (c) non-expert examiners (n = 33; green) and AI models (blue), with gray lines indicating matched case sets. The box plots show the median and the 25th and 75th percentiles, and the whiskers span the range of non-outlier values. The density plots show the distributions of the overall F1 scores (made with kernel smoothing).
Fig. 2
Fig. 2. AI model ROC curve and human examiner performance.
The model performance is given as an ROC curve in blue, with shaded 95% confidence bands constructed from the 2.5th and 97.5th percentiles of sensitivity values, at each level of specificity, from bootstrapped ROC curves. Each dot represents a human examiner, with non-experts in green and experts in orange. The performance of the AI models at the default cutoff point of 0.5, and the mean performance for expert and non-expert examiners, are each marked by a black cross. The mean performance for expert and non-expert examiners are each surrounded by a shaded 95% confidence region, estimated by a bivariate random-effects model. Note that the models were evaluated on all 2,660 reviewed cases, but each individual examiner assessed only a subset of these cases. Hence, although multiple individual expert examiners seem to outperform, or perform on par with the models, by being positioned above or to the left of the ROC curve of the models, no examiner outperformed the models on the same case set, which can be seen in Fig. 1 and Supplementary Table 2.
Fig. 3
Fig. 3. Subgroup analysis.
ac, Comparison of the AI models and expert and non-expert examiners, for different (a) medical centers, (b) ultrasound systems (limited to the eight most common systems), and (c) histological diagnoses. The box plots show the median and the 25th and 75th percentiles, and the whiskers indicate 95% confidence intervals through bootstrapping.
Fig. 4
Fig. 4. Current practice and proposed AI-assisted strategy for triage workflow.
a, In the current practice, a non-expert examiner makes an initial assessment, and patients with an uncertain diagnosis or presumed malignancy are referred to an expert. Additionally, with gynecologists in training (residents), most newly detected lesions are referred to an expert examiner, independently of the finding. b, In our proposed AI-assisted triage strategy, the AI model and a non-expert examiner each make an initial assessment, and then an expert examiner makes the final decision in cases of disagreement. *The proposed AI-assisted strategy can also be used with an expert as the initial examiner.
Extended Data Fig. 1
Extended Data Fig. 1. Study flow diagram.
*These cases were excluded from the main analysis as they had not been included in compliance with our criterion on the temporal distribution of examination dates. The Olbia center was excluded from the human review due to its limited sample size (n = 57) and its small number of malignant cases (n = 8). These cases were excluded in order to have a test set of comparable size (n = 300) to those of the other centers and to utilize our reviewer resources efficiently.
Extended Data Fig. 2
Extended Data Fig. 2. Performance of AI models and human examiners by level of confidence in assessment.
F1 scores for (a) expert examiners and AI models and (b) non-expert examiners and AI models, partitioned by the examiner’s confidence in their assessment. For each level of confidence (certain, probable, uncertain), all assessments with the corresponding level of confidence were pooled. The box plots show the median and the 25th and 75th percentiles, and the whiskers indicate 95% confidence intervals through bootstrapping.
Extended Data Fig. 3
Extended Data Fig. 3. Subgroup analysis.
Comparison of the AI models and expert and non-expert examiners, for different (a) age groups and (b) years of examination. The box plots show the median and the 25th and 75th percentiles, and the whiskers indicate 95% confidence intervals through bootstrapping. Information on patient age was missing for 125 patients.
Extended Data Fig. 4
Extended Data Fig. 4. Calibration curve of AI models.
A calibration curve of the AI models is shown in solid black with 95% confidence bands in gray, depicting the relationship between the predicted risk of malignancy and the actual observed proportion of malignancy. The dotted line represents the ideal scenario of perfect calibration, where the predicted risks precisely match the observed outcomes. The histograms at the bottom depict the distributions of predicted risks of malignancy, for malignant and benign tumors, above and below the horizontal line, respectively. The calibration curve and confidence bands are based on local regression (loess),24 and is based on 12,673 image-level predictions. While not depicted in this figure, a linear logistic calibration curve was also fitted, yielding an intercept of −0.19 (95% CI, −0.24–(−)0.14) and a slope of 1.00 (95% CI, 0.96–1.03), also indicating well-calibrated risk predictions.
Extended Data Fig. 5
Extended Data Fig. 5. Image cropping and annotation.
(a) An uncropped image, as provided by a participating center, and (b) the corresponding cropped image used for training and evaluation. Images were coarsely cropped, mainly by removing the outer borders and burnt-in scanner settings, and occasionally also excluding surrounding structures. Within the cropped images, artifacts such as text were blacked out by setting the pixel values to zero.
Extended Data Fig. 6
Extended Data Fig. 6. Saliency maps.
Attention-based saliency maps from the AI models for a few uncropped images of a (a) serous cystadenoma, (b) tubal cancer, (c) urothelial cancer metastasis, (d) colorectal cancer metastasis and (e–f) serous borderline tumors. The attention maps demonstrate that the models focus on areas of clear diagnostic relevance, such as (b) vascularized and (e) irregular (ac) solid components, (d) with densely packed locules and (f) papillary projection, while ignoring image artifacts, such as text, (e) calipers or (f) thumbnails.

References

    1. Yazbek, J. et al. Effect of quality of gynaecological ultrasonography on management of patients with suspected ovarian cancer: a randomised controlled trial. Lancet Oncol.9, 124–131 (2008). - PubMed
    1. Froyman, W. et al. Risk of complications in patients with conservatively managed ovarian tumours (IOTA5): a 2-year interim analysis of a multicentre, prospective, cohort study. Lancet Oncol.20, 448–458 (2019). - PubMed
    1. Vergote, I. et al. Prognostic importance of degree of differentiation and cyst rupture in stage I invasive epithelial ovarian carcinoma. Lancet357, 176–182 (2001). - PubMed
    1. Bristow, R. E., Tomacruz, R. S., Armstrong, D. K., Trimble, E. L. & Montz, F. J. Survival effect of maximal cytoreductive surgery for advanced ovarian carcinoma during the platinum era: a meta-analysis. J. Clin. Oncol.41, 4065–4076 (2023). - PubMed
    1. Timmerman, D. et al. ESGO/ISUOG/IOTA/ESGE Consensus Statement on pre-operative diagnosis of ovarian tumors. Int. J. Gynecol. Cancer31, 961–982 (2021). - PMC - PubMed

LinkOut - more resources