Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Dec;35(12):1759-1769.
doi: 10.1038/s41379-022-01147-y. Epub 2022 Sep 10.

Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology

Affiliations
Review

Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology

André Homeyer et al. Mod Pathol. 2022 Dec.

Erratum in

Abstract

Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations on compiling test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help pathologists and regulatory agencies verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.

PubMed Disclaimer

Conflict of interest statement

F.Z. is a shareholder of asgen GmbH. P.S. is a member of the supervisory board of asgen GmbH. All other authors declare that they have no conflict of interest.

Figures

Fig. 1
Fig. 1. Schematic overview of sampling regimes for performance assessment in the entire target population of images or in specific subsets.
Overall performance assessment requires a representative sample along all dimensions of variability, relevant subsets are typically limited along one dimension (e.g., age range or scanner type).
Fig. 2
Fig. 2. Examples of variability between biopsy images, illustrating a combination of inter- and intraindividual biological variability (tissue structure) and inter-individual technical variability (staining).
The images show H&E-stained breast tissue of female patients with invasive carcinomas of no special type, scanned at 40× objective magnification.
Fig. 3
Fig. 3. Examples of different severity levels of imaging artifacts.
The leftmost images are clearly within the intended use of algorithms for analyzing breast cancer histologies, whereas the rightmost images are clearly unsuitable. However, it is not obvious where to draw the line between those two regimes. The top row shows simulated foreign objects, the bottom row shows simulated focal blur, the original tissue images show H&E-stained breast tissue of female patients with invasive carcinomas of no special type, scanned at 40× objective magnification (same as in Fig. 1).
Fig. 4
Fig. 4. Overview of recommendations on compiling test datasets.
Prior to data acquisition, the acquisition process must be thoroughly planned. In particular, the intended use of the AI solution must be precisely understood in order to derive the requirements for test datasets.

References

    1. Serag A, Ion-Margineanu A, Qureshi H, McMillan R, Martin M-JS, Diamond J, et al. Translational AI and deep learning in diagnostic pathology. Front Med 6, (2019). - PMC - PubMed
    1. Abels E, Pantanowitz L, Aeffner F, Zarella MD, Laak J, Bui MM, et al. Computational pathology definitions, best practices, and recommendations for regulatory guidance: A white paper from the digital pathology association. J Pathol 249, 286–294 (2019). - PMC - PubMed
    1. Moxley-Wyles B, Colling R, Verrill C. Artificial intelligence in pathology: An overview. Diagn Histopathol 26, 513–520 (2020).
    1. Echle A, Rindtorff NT, Brinker TJ, Luedde T, Pearson AT, Kather JN. Deep learning in cancer pathology: A new generation of clinical biomarkers. Br J Cancer 124, 686–696 (2021). - PMC - PubMed
    1. Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med 24, 1559–1567 (2018). - PMC - PubMed

Publication types

LinkOut - more resources