Multicenter Study

. 2025 Jan;31(1):189-196.

doi: 10.1038/s41591-024-03329-4. Epub 2025 Jan 2.

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Filip Christiansen^#^{1

2

3

4}, Emir Konuk^#^{3

4}, Adithya Raju Ganeshan^{1

3

4}, Robert Welch^{1

3

4}, Joana Palés Huix^{3

4}, Artur Czekierdowski⁵, Francesco Paolo Giuseppe Leone⁶, Lucia Anna Haak^{7

8}, Robert Fruscio^{9

10}, Adrius Gaurilcikas¹¹, Dorella Franchi¹², Daniela Fischerova¹³, Elisa Mor¹⁴, Luca Savelli¹⁵, Maria Àngela Pascual¹⁶, Marek Jerzy Kudla¹⁷, Stefano Guerriero¹⁸, Francesca Buonomo¹⁹, Karina Liuba²⁰, Nina Montik²¹, Juan Luis Alcázar²², Ekaterini Domali²³, Nelinda Catherine P Pangilinan²⁴, Chiara Carella⁶, Maria Munaretto²⁵, Petra Saskova¹³, Debora Verri²⁶, Chiara Visenzi¹⁴, Pawel Herman^{3

27}, Kevin Smith^{3

4}, Elisabeth Epstein^{28

29}

Affiliations

¹ Department of Clinical Science and Education, Södersjukhuset, Karolinska Institutet, Stockholm, Sweden.
² Department of Obstetrics and Gynecology, Södersjukhuset, Stockholm, Sweden.
³ School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden.
⁴ Science for Life Laboratory, Stockholm, Sweden.
⁵ Department of Gynecological Oncology and Gynecology, Medical University of Lublin, Lublin, Poland.
⁶ Unit of Obstetrics & Gynecology, Department of Biomedical and Clinical Sciences, Luigi Sacco University Hospital, University of Milan, Milan, Italy.
⁷ Institute for the Care of Mother and Child, Prague, Czech Republic.
⁸ Third Faculty of Medicine, Charles University, Prague, Czech Republic.
⁹ Department of Medicine and Surgery, University of Milan-Bicocca, Milan, Italy.
¹⁰ UO Gynecology, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.
¹¹ Department of Obstetrics and Gynaecology, Lithuanian University of Health Sciences, Kaunas, Lithuania.
¹² Unit of Preventive Gynecology, European Institute of Oncology IRCCS, Milan, Italy.
¹³ Gynecologic Oncology Centre, Department of Gynecology, Obstetrics and Neonatology, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic.
¹⁴ Fondazione Poliambulanza Istituto Ospedaliero, Brescia, Italy.
¹⁵ Obstetrics and Gynecology Unit, Forlì and Faenza Hospitals, AUSL Romagna, Forlì, Italy.
¹⁶ Department of Obstetrics, Gynecology, and Reproduction, Dexeus University Hospital, Barcelona, Spain.
¹⁷ Department of Perinatology and Oncological Gynecology, Faculty of Medical Sciences, Medical University of Silesia, Katowice, Poland.
¹⁸ Centro Integrato di Procreazione Medicalmente Assistita e Diagnostica Ostetrico-Ginecologica, Azienda Ospedaliero Universitaria-Policlinico Duilio Casula, Monserrato, University of Cagliari, Cagliari, Italy.
¹⁹ Institute for Maternal and Child Health, IRCCS 'Burlo Garofolo', Trieste, Italy.
²⁰ Department of Obstetrics and Gynecology, Skåne University Hospital, Lund, Sweden.
²¹ Section of Obstetrics and Gynecology, Department of Clinical Sciences, Università Politecnica delle Marche, Azienda Ospedaliero-Universitaria delle Marche, Ancona, Italy.
²² Department of Obstetrics and Gynecology, Clínica Universidad de Navarra, Pamplona, Spain.
²³ First Department of Obstetrics and Gynecology, Alexandra Hospital, Medical School, National and Kapodistrian University of Athens, Athens, Greece.
²⁴ Department of Obstetrics and Gynecology, Rizal Medical Center, Manila, Philippines.
²⁵ Gynecologic and Obstetric Unit, Women's and Children's Department, Forlì Hospital, Forlì, Italy.
²⁶ Gynecology and Breast Care Center, Mater Olbia Hospital, Olbia, Italy.
²⁷ Digital Futures, KTH Royal Institute of Technology, Stockholm, Sweden.
²⁸ Department of Clinical Science and Education, Södersjukhuset, Karolinska Institutet, Stockholm, Sweden. elisabeth.epstein@ki.se.
²⁹ Department of Obstetrics and Gynecology, Södersjukhuset, Stockholm, Sweden. elisabeth.epstein@ki.se.

^# Contributed equally.

PMID: 39747679
PMCID: PMC11750711
DOI: 10.1038/s41591-024-03329-4

Multicenter Study

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Filip Christiansen et al. Nat Med. 2025 Jan.

. 2025 Jan;31(1):189-196.

doi: 10.1038/s41591-024-03329-4. Epub 2025 Jan 2.

Authors

Affiliations

¹ Department of Clinical Science and Education, Södersjukhuset, Karolinska Institutet, Stockholm, Sweden.
² Department of Obstetrics and Gynecology, Södersjukhuset, Stockholm, Sweden.
³ School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden.
⁴ Science for Life Laboratory, Stockholm, Sweden.
⁵ Department of Gynecological Oncology and Gynecology, Medical University of Lublin, Lublin, Poland.
⁶ Unit of Obstetrics & Gynecology, Department of Biomedical and Clinical Sciences, Luigi Sacco University Hospital, University of Milan, Milan, Italy.
⁷ Institute for the Care of Mother and Child, Prague, Czech Republic.
⁸ Third Faculty of Medicine, Charles University, Prague, Czech Republic.
⁹ Department of Medicine and Surgery, University of Milan-Bicocca, Milan, Italy.
¹⁰ UO Gynecology, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.
¹¹ Department of Obstetrics and Gynaecology, Lithuanian University of Health Sciences, Kaunas, Lithuania.
¹² Unit of Preventive Gynecology, European Institute of Oncology IRCCS, Milan, Italy.
¹³ Gynecologic Oncology Centre, Department of Gynecology, Obstetrics and Neonatology, First Faculty of Medicine, Charles University and General University Hospital in Prague, Prague, Czech Republic.
¹⁴ Fondazione Poliambulanza Istituto Ospedaliero, Brescia, Italy.
¹⁵ Obstetrics and Gynecology Unit, Forlì and Faenza Hospitals, AUSL Romagna, Forlì, Italy.
¹⁶ Department of Obstetrics, Gynecology, and Reproduction, Dexeus University Hospital, Barcelona, Spain.
¹⁷ Department of Perinatology and Oncological Gynecology, Faculty of Medical Sciences, Medical University of Silesia, Katowice, Poland.
¹⁸ Centro Integrato di Procreazione Medicalmente Assistita e Diagnostica Ostetrico-Ginecologica, Azienda Ospedaliero Universitaria-Policlinico Duilio Casula, Monserrato, University of Cagliari, Cagliari, Italy.
¹⁹ Institute for Maternal and Child Health, IRCCS 'Burlo Garofolo', Trieste, Italy.
²⁰ Department of Obstetrics and Gynecology, Skåne University Hospital, Lund, Sweden.
²¹ Section of Obstetrics and Gynecology, Department of Clinical Sciences, Università Politecnica delle Marche, Azienda Ospedaliero-Universitaria delle Marche, Ancona, Italy.
²² Department of Obstetrics and Gynecology, Clínica Universidad de Navarra, Pamplona, Spain.
²³ First Department of Obstetrics and Gynecology, Alexandra Hospital, Medical School, National and Kapodistrian University of Athens, Athens, Greece.
²⁴ Department of Obstetrics and Gynecology, Rizal Medical Center, Manila, Philippines.
²⁵ Gynecologic and Obstetric Unit, Women's and Children's Department, Forlì Hospital, Forlì, Italy.
²⁶ Gynecology and Breast Care Center, Mater Olbia Hospital, Olbia, Italy.
²⁷ Digital Futures, KTH Royal Institute of Technology, Stockholm, Sweden.
²⁸ Department of Clinical Science and Education, Södersjukhuset, Karolinska Institutet, Stockholm, Sweden. elisabeth.epstein@ki.se.
²⁹ Department of Obstetrics and Gynecology, Södersjukhuset, Stockholm, Sweden. elisabeth.epstein@ki.se.

^# Contributed equally.

PMID: 39747679
PMCID: PMC11750711
DOI: 10.1038/s41591-024-03329-4

Abstract

Ovarian lesions are common and often incidentally detected. A critical shortage of expert ultrasound examiners has raised concerns of unnecessary interventions and delayed cancer diagnoses. Deep learning has shown promising results in the detection of ovarian cancer in ultrasound images; however, external validation is lacking. In this international multicenter retrospective study, we developed and validated transformer-based neural network models using a comprehensive dataset of 17,119 ultrasound images from 3,652 patients across 20 centers in eight countries. Using a leave-one-center-out cross-validation scheme, for each center in turn, we trained a model using data from the remaining centers. The models demonstrated robust performance across centers, ultrasound systems, histological diagnoses and patient age groups, significantly outperforming both expert and non-expert examiners on all evaluated metrics, namely F1 score, sensitivity, specificity, accuracy, Cohen's kappa, Matthew's correlation coefficient, diagnostic odds ratio and Youden's J statistic. Furthermore, in a retrospective triage simulation, artificial intelligence (AI)-driven diagnostic support reduced referrals to experts by 63% while significantly surpassing the diagnostic performance of the current practice. These results show that transformer-based models exhibit strong generalization and above human expert-level diagnostic accuracy, with the potential to alleviate the shortage of expert ultrasound examiners and improve patient outcomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: E.E., K.S., F.C., E.K. and P.H. have applied for a patent (European patent application 23220765.4) that is pending to a company named Intelligyn. The patent covers methods for a computer-aided diagnostic system to improve generalization and protect against bias. E.E., K.S. and F.C. hold stock in Intelligyn, where E.E. also has an unpaid leadership role. N.C.P.’s institution has received payments for activities not related to this article, including lectures, presentations, expert testimonies, and service on speakers’ bureaus, as well as for travel support. N.C.P. has been an advisory board member of Mindray and GE Healthcare and has held unpaid leadership roles in the POGS Organization of Government Institutions (and the Rizal Medical Service Delivery Network, which are Philippine governmental institutions with the aim to facilitate smooth referral of patients. The other authors declare no competing interests.

Figures

**Fig. 1. Paired F1 scores between human examiners and AI models.**
a, Paired F1 scores between individual examiners (n = 66) and the AI models on matched case sets, that is, each examiner is compared against the AI models on the set of cases he or she assessed. A dot above the dashed line corresponds to an individual examiner that was outperformed by the AI models on the same set of cases. b,c, Paired F1 scores between (b) expert examiners (n = 33; orange) and AI models (blue), and (c) non-expert examiners (n = 33; green) and AI models (blue), with gray lines indicating matched case sets. The box plots show the median and the 25th and 75th percentiles, and the whiskers span the range of non-outlier values. The density plots show the distributions of the overall F1 scores (made with kernel smoothing).

**Fig. 2. AI model ROC curve and human examiner performance.**
The model performance is given as an ROC curve in blue, with shaded 95% confidence bands constructed from the 2.5th and 97.5th percentiles of sensitivity values, at each level of specificity, from bootstrapped ROC curves. Each dot represents a human examiner, with non-experts in green and experts in orange. The performance of the AI models at the default cutoff point of 0.5, and the mean performance for expert and non-expert examiners, are each marked by a black cross. The mean performance for expert and non-expert examiners are each surrounded by a shaded 95% confidence region, estimated by a bivariate random-effects model. Note that the models were evaluated on all 2,660 reviewed cases, but each individual examiner assessed only a subset of these cases. Hence, although multiple individual expert examiners seem to outperform, or perform on par with the models, by being positioned above or to the left of the ROC curve of the models, no examiner outperformed the models on the same case set, which can be seen in Fig. 1 and Supplementary Table 2.

**Fig. 3. Subgroup analysis.**
a–c, Comparison of the AI models and expert and non-expert examiners, for different (a) medical centers, (b) ultrasound systems (limited to the eight most common systems), and (c) histological diagnoses. The box plots show the median and the 25^th and 75^th percentiles, and the whiskers indicate 95% confidence intervals through bootstrapping.

**Fig. 4. Current practice and proposed AI-assisted strategy for triage workflow.**
a, In the current practice, a non-expert examiner makes an initial assessment, and patients with an uncertain diagnosis or presumed malignancy are referred to an expert. Additionally, with gynecologists in training (residents), most newly detected lesions are referred to an expert examiner, independently of the finding. b, In our proposed AI-assisted triage strategy, the AI model and a non-expert examiner each make an initial assessment, and then an expert examiner makes the final decision in cases of disagreement. *The proposed AI-assisted strategy can also be used with an expert as the initial examiner.

**Extended Data Fig. 1. Study flow diagram.**
^*These cases were excluded from the main analysis as they had not been included in compliance with our criterion on the temporal distribution of examination dates. ^† The Olbia center was excluded from the human review due to its limited sample size (n = 57) and its small number of malignant cases (n = 8). ^‡ These cases were excluded in order to have a test set of comparable size (n = 300) to those of the other centers and to utilize our reviewer resources efficiently.

**Extended Data Fig. 2. Performance of AI models and human examiners by level of confidence in assessment.**
F1 scores for (a) expert examiners and AI models and (b) non-expert examiners and AI models, partitioned by the examiner’s confidence in their assessment. For each level of confidence (certain, probable, uncertain), all assessments with the corresponding level of confidence were pooled. The box plots show the median and the 25^th and 75^th percentiles, and the whiskers indicate 95% confidence intervals through bootstrapping.

**Extended Data Fig. 3. Subgroup analysis.**
Comparison of the AI models and expert and non-expert examiners, for different (a) age groups and (b) years of examination. The box plots show the median and the 25th and 75th percentiles, and the whiskers indicate 95% confidence intervals through bootstrapping. Information on patient age was missing for 125 patients.

**Extended Data Fig. 4. Calibration curve of AI models.**
A calibration curve of the AI models is shown in solid black with 95% confidence bands in gray, depicting the relationship between the predicted risk of malignancy and the actual observed proportion of malignancy. The dotted line represents the ideal scenario of perfect calibration, where the predicted risks precisely match the observed outcomes. The histograms at the bottom depict the distributions of predicted risks of malignancy, for malignant and benign tumors, above and below the horizontal line, respectively. The calibration curve and confidence bands are based on local regression (loess),24 and is based on 12,673 image-level predictions. While not depicted in this figure, a linear logistic calibration curve was also fitted, yielding an intercept of −0.19 (95% CI, −0.24–(−)0.14) and a slope of 1.00 (95% CI, 0.96–1.03), also indicating well-calibrated risk predictions.

**Extended Data Fig. 5. Image cropping and annotation.**
(a) An uncropped image, as provided by a participating center, and (b) the corresponding cropped image used for training and evaluation. Images were coarsely cropped, mainly by removing the outer borders and burnt-in scanner settings, and occasionally also excluding surrounding structures. Within the cropped images, artifacts such as text were blacked out by setting the pixel values to zero.

**Extended Data Fig. 6. Saliency maps.**
Attention-based saliency maps from the AI models for a few uncropped images of a (a) serous cystadenoma, (b) tubal cancer, (c) urothelial cancer metastasis, (d) colorectal cancer metastasis and (**e–f**) serous borderline tumors. The attention maps demonstrate that the models focus on areas of clear diagnostic relevance, such as (b) vascularized and (e) irregular (a–c) solid components, (d) with densely packed locules and (f) papillary projection, while ignoring image artifacts, such as text, (e) calipers or (f) thumbnails.

See this image and copyright information in PMC

References

1. Yazbek, J. et al. Effect of quality of gynaecological ultrasonography on management of patients with suspected ovarian cancer: a randomised controlled trial. Lancet Oncol.9, 124–131 (2008). - DOI - PubMed
1. Froyman, W. et al. Risk of complications in patients with conservatively managed ovarian tumours (IOTA5): a 2-year interim analysis of a multicentre, prospective, cohort study. Lancet Oncol.20, 448–458 (2019). - DOI - PubMed
1. Vergote, I. et al. Prognostic importance of degree of differentiation and cyst rupture in stage I invasive epithelial ovarian carcinoma. Lancet357, 176–182 (2001). - DOI - PubMed
1. Bristow, R. E., Tomacruz, R. S., Armstrong, D. K., Trimble, E. L. & Montz, F. J. Survival effect of maximal cytoreductive surgery for advanced ovarian carcinoma during the platinum era: a meta-analysis. J. Clin. Oncol.41, 4065–4076 (2023). - DOI - PubMed
1. Timmerman, D. et al. ESGO/ISUOG/IOTA/ESGE Consensus Statement on pre-operative diagnosis of ovarian tumors. Int. J. Gynecol. Cancer31, 961–982 (2021). - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Affiliations

International multicenter validation of AI-driven ultrasound detection of ovarian cancer

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical