Assessing generalizability of an AI-based visual test for cervical cancer screening

Syed Rakin Ahmed^{1

2

3

4}, Didem Egemen⁵, Brian Befano^{6

7}, Ana Cecilia Rodriguez⁵, Jose Jeronimo⁵, Kanan Desai⁵, Carolina Teran⁸, Karla Alfaro⁹, Joel Fokom-Domgue^{10

11

12}, Kittipat Charoenkwan¹³, Chemtai Mungo¹⁴, Rebecca Luckett¹⁵, Rakiya Saidu¹⁶, Taina Raiol^{17

18}, Ana Ribeiro^{17

18}, Julia C Gage¹⁹, Silvia de Sanjose^{5

20}, Jayashree Kalpathy-Cramer^{1

21}, Mark Schiffman⁵

Affiliations

¹ Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts, United States of America.
² Harvard Graduate Program in Biophysics, Harvard Medical School, Harvard University, Cambridge, Massachusetts, United States of America.
³ Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
⁴ Geisel School of Medicine at Dartmouth, Dartmouth College, Hanover, New Hampshire, United States of America.
⁵ Clinical Epidemiology Unit, Clinical Genetics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America.
⁶ Information Management Services, Calverton, Maryland, United States of America.
⁷ University of Washington, Seattle, Washington, United States of America.
⁸ Facultad de Medicina, Universidad Mayor, Real y Pontificia de San Francisco Xavier de Chuquisaca, Sucre, Bolivia.
⁹ Basic Health International, El Salvador.
¹⁰ Cameroon Baptist Convention Health Services, Bamenda, North West Region, Cameroon.
¹¹ Department of Obstetrics and Gynecology, Faculty of Medicine and Biomedical Sciences, University of Yaoundé, Yaoundé, Cameroon.
¹² Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America.
¹³ Department of Obstetrics and Gynecology, Chiang Mai University, Chiang Mai, Thailand.
¹⁴ Department of Obstetrics and Gynecology, University of North Carolina-Chapel Hill School of Medicine, Chapel Hill, North Carolina, United States of America.
¹⁵ Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America.
¹⁶ Department of Obstetrics and Gynaecology and South African Medical Research Council Gynaecological Cancer Research Centre, University of Cape Town, Cape Town.
¹⁷ Center for Epidemiology and Health Surveillance, Oswaldo Cruz Foundation (Fiocruz), Brasília, Federal District, Brazil.
¹⁸ MARCO Clinical and Molecular Research Center, University Hospital of Brasilia/EBSERH, Federal District, Brazil.
¹⁹ Center for Global Health, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America.
²⁰ ISGlobal, Barcelona, Spain.
²¹ Department of Ophthalmology, University of Colorado Anschutz, Denver, Colorado, United States of America.

PMID: 39356713
PMCID: PMC11446437
DOI: 10.1371/journal.pdig.0000364

Assessing generalizability of an AI-based visual test for cervical cancer screening

Syed Rakin Ahmed et al. PLOS Digit Health. 2024.

. 2024 Oct 2;3(10):e0000364.

doi: 10.1371/journal.pdig.0000364. eCollection 2024 Oct.

Authors

Affiliations

¹ Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts, United States of America.
² Harvard Graduate Program in Biophysics, Harvard Medical School, Harvard University, Cambridge, Massachusetts, United States of America.
³ Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
⁴ Geisel School of Medicine at Dartmouth, Dartmouth College, Hanover, New Hampshire, United States of America.
⁵ Clinical Epidemiology Unit, Clinical Genetics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America.
⁶ Information Management Services, Calverton, Maryland, United States of America.
⁷ University of Washington, Seattle, Washington, United States of America.
⁸ Facultad de Medicina, Universidad Mayor, Real y Pontificia de San Francisco Xavier de Chuquisaca, Sucre, Bolivia.
⁹ Basic Health International, El Salvador.
¹⁰ Cameroon Baptist Convention Health Services, Bamenda, North West Region, Cameroon.
¹¹ Department of Obstetrics and Gynecology, Faculty of Medicine and Biomedical Sciences, University of Yaoundé, Yaoundé, Cameroon.
¹² Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America.
¹³ Department of Obstetrics and Gynecology, Chiang Mai University, Chiang Mai, Thailand.
¹⁴ Department of Obstetrics and Gynecology, University of North Carolina-Chapel Hill School of Medicine, Chapel Hill, North Carolina, United States of America.
¹⁵ Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America.
¹⁶ Department of Obstetrics and Gynaecology and South African Medical Research Council Gynaecological Cancer Research Centre, University of Cape Town, Cape Town.
¹⁷ Center for Epidemiology and Health Surveillance, Oswaldo Cruz Foundation (Fiocruz), Brasília, Federal District, Brazil.
¹⁸ MARCO Clinical and Molecular Research Center, University Hospital of Brasilia/EBSERH, Federal District, Brazil.
¹⁹ Center for Global Health, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America.
²⁰ ISGlobal, Barcelona, Spain.
²¹ Department of Ophthalmology, University of Colorado Anschutz, Denver, Colorado, United States of America.

PMID: 39356713
PMCID: PMC11446437
DOI: 10.1371/journal.pdig.0000364

Abstract

A number of challenges hinder artificial intelligence (AI) models from effective clinical translation. Foremost among these challenges is the lack of generalizability, which is defined as the ability of a model to perform well on datasets that have different characteristics from the training data. We recently investigated the development of an AI pipeline on digital images of the cervix, utilizing a multi-heterogeneous dataset of 9,462 women (17,013 images) and a multi-stage model selection and optimization approach, to generate a diagnostic classifier able to classify images of the cervix into "normal", "indeterminate" and "precancer/cancer" (denoted as "precancer+") categories. In this work, we investigate the performance of this multiclass classifier on external data not utilized in training and internal validation, to assess the generalizability of the classifier when moving to new settings. We assessed both the classification performance and repeatability of our classifier model across the two axes of heterogeneity present in our dataset: image capture device and geography, utilizing both out-of-the-box inference and retraining with external data. Our results demonstrate that device-level heterogeneity affects our model performance more than geography-level heterogeneity. Classification performance of our model is strong on images from a new geography without retraining, while incremental retraining with inclusion of images from a new device progressively improves classification performance on that device up to a point of saturation. Repeatability of our model is relatively unaffected by data heterogeneity and remains strong throughout. Our work supports the need for optimized retraining approaches that address data heterogeneity (e.g., when moving to a new device) to facilitate effective use of AI models in new settings.

Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Uniform manifold approximation and projections (UMAP) highlighting the relative distributions of the datasets, devices and geographies investigated in this work.**
Each subplot highlights a different representation of the UMAP, where the color coding (highlighted in the corresponding legend at the top of each subplot) is at the (a) dataset-level, (b) device-level and (c) geography-level. The datasets and devices occupy distinct clusters in (a) and (b), while the geographies are all clustered together within the same device in (c). The x- and y-axes are in arbitrary units, representing the two UMAP components on which the higher dimensional data was projected.

**Fig 2**
(a) Bounding boxes generated from running the cervix detector, highlighted in white, around 50 randomly selected images from the external (“EXT”) dataset. The cervix detector utilized a YOLOv5 architecture trained on the “SEED” dataset images. (b) Bound and cropped images of the cervix which are passed onto the diagnostic classifier (AVE).

**Fig 3. Results from the first set of the generalizability analyses, highlighting that device level heterogeneity impacts our model performance greater than geography level heterogeneity.**
The classification performance and repeatability plots depicted here include (a) receiver operating characteristics (ROC) curves; (b) confusion matrices; and (c) Bland-Altman plots, for models that were (i) trained on “SEED” and tested on a held-aside set from “SEED”; (ii) trained on “SEED” and tested on “EXT”; and (iii) trained on a dataset comprising of “SEED” + all images from “EXT” except Bolivia and tested on Bolivia images from “EXT”. “Gray Zone” = “Indeterminate”.

**Fig 4. Results from the second set of generalizability analysis, highlighting that retraining can improve performance on a new device previously not present in the “SEED”.**
(a) Model level comparison across models representing incremental additions of “EXT” (J8) images at the woman level to the training set of “SEED” images, with the “EXT” images added in (i) a 1n normal (N): 1n indeterminate (I): 1n precancer+ (P) ratio; and (ii) a 2n N: 2n I: 1n P ratio of ground truth classes at the woman level, where n = # of precancer+ women added (y-axes) (b) Plots of area under receiver operating characteristics curve (AUC) vs. # women added to the training set per ground truth class, in the same ratios as in (a). For example, in (ii), the x-axis represents the # precancer+ (P) women added (n) in the ratio 2n N: 2n I: 1n P to the training set. The top row plots the Normal (class 0) vs. Rest AUC, while the bottom row plots the Precancer+ (class 2) vs. rest AUC, respectively, on the y-axis. In panel (a) “normal” = green, “indeterminate” / “gray zone” = gray and “precancer+” = red.

See this image and copyright information in PMC

References

1. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nat 2017 5427639. 2017;542: 115–118. doi: 10.1038/nature21056 - DOI - PMC - PubMed
1. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med 2019 251. 2019;25: 65–69. doi: 10.1038/s41591-018-0268-3 - DOI - PMC - PubMed
1. Piccialli F, Somma V Di, Giampaolo F, Cuomo S, Fortino G. A survey on deep learning in medicine: Why, how and when? Inf Fusion. 2021;66: 111–137. doi: 10.1016/J.INFFUS.2020.09.006 - DOI
1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019 251. 2019;25: 44–56. doi: 10.1038/s41591-018-0300-7 - DOI - PubMed
1. Gidwani M, Chang K, Patel JB, Hoebel KV, Ahmed SR, Singh P, et al. Inconsistent Partitioning and Unproductive Feature Associations Yield Idealized Radiomic Models. 2022. [cited 3 Jan 2023]. doi: 10.1148/radiol.220715 - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing generalizability of an AI-based visual test for cervical cancer screening

Affiliations

Assessing generalizability of an AI-based visual test for cervical cancer screening

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources