Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 24:12:54.
doi: 10.4103/jpi.jpi_6_21. eCollection 2021.

Stress Testing Pathology Models with Generated Artifacts

Affiliations

Stress Testing Pathology Models with Generated Artifacts

Nicholas Chandler Wang et al. J Pathol Inform. .

Abstract

Background: Machine learning models provide significant opportunities for improvement in health care, but their "black-box" nature poses many risks.

Methods: We built a custom Python module as part of a framework for generating artifacts that are meant to be tunable and describable to allow for future testing needs. We conducted an analysis of a previously published digital pathology classification model and an internally developed kidney tissue segmentation model, utilizing a variety of generated artifacts including testing their effects. The artifacts simulated were bubbles, tissue folds, uneven illumination, marker lines, uneven sectioning, altered staining, and tissue tears.

Results: We found that there is some performance degradation on the tiles with artifacts, particularly with altered stains but also with marker lines, tissue folds, and uneven sectioning. We also found that the response of deep learning models to artifacts could be nonlinear.

Conclusions: Generated artifacts can provide a useful tool for testing and building trust in machine learning models by understanding where these models might fail.

Keywords: Artifact; digital pathology; failure mode; machine learning; neural network; robustness.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts of interest.

Figures

Figure 1
Figure 1
An example of seven types of simulated artifact, bubbles, tissue folds, uneven illumination, pen marks, sectioning artifacts, altered staining, and tissue tears. These artifacts are applied to the lung tissue tiles like in this example
Figure 2
Figure 2
Schematic of processing steps undertaken by our Snakemake workflow. The “manipulate” tiles step in red was only applied to experimental studies
Figure 3
Figure 3
Tile level average predicted probabilities after artifacts were added. The three tissue types were split into separate categories, and the average probability is shown relative to the control set of images. In some cases, the probabilities had more spread, indicating some tiles had higher uncertainty
Figure 4
Figure 4
Tile level area under the receiver operating characteristics with confidence intervals after artifacts were added. The three tissue types were split into separate categories, and the subclass area under the receiver operating characteristic is shown relative to the control set of images. On a tile level, predictive performance was somewhat decreased by the artifacts introduced. Note that the area under the receiver operating characteristic y-axis ranges from 0.50 to 1
Figure 5
Figure 5
Area under the receiver operating characteristic and accuracy of kidney segmentation model after addition of artifacts, broken down by tissue component type (subclass). Area under the receiver operating characteristic focuses on the change in probability score produced by the model and its effect, whereas accuracy shows the change in predicted class. The baseline performance in the control experiment is described at the top of each chart. While tissue fold, marker, and stain alterations show the biggest changes, there is a lot of variability between subclasses. The prevalence of each subclass in the ground truth varies by several orders of magnitude across the six subclasses
Figure 6
Figure 6
A selected example of the effects of artifacts on a single kidney tissue sample tile. This tile is notable for the presence of four tissue component types (subclasses) in the ground truth. Notably, the baseline model had difficulty properly identifying interstitium in this tile. Tissue folds and marker line artifacts both resulted in the “miscellaneous (Misc)” label originally designated for stain deposit artifacts
Figure 7
Figure 7
A second selected example of the effects of the impacts of artifacts. This example had only two tissue component types (subclasses) represented in the ground truth label, tubules, and interstitium. Of the ten tiles selected for further inspection, eight of ten had only these two most common labels. While tissue fold and marker artifacts did have the most effect on the labels, overall most of the artifacts applied had limited impact on this tile.

References

    1. Watson DS, Krutzinna J, Bruce IN, Griffiths CE, McInnes IB, Barnes MR, et al. Clinical applications of machine learning algorithms: Beyond the black box. BMJ. 2019;364:l886. - PubMed
    1. Shamout F, Zhu T, Clifton L, Briggs J, Prytherch D, Meredith P, et al. Early warning score adjusted for age to predict the composite outcome of mortality, cardiac arrest or unplanned intensive care unit admission using observational vital-sign data: A multicentre development and validation. BMJ Open. 2019;9:e033301. - PMC - PubMed
    1. Jennings L, Deerlin VM, Gulley ML. Recommended principles and practices for validating clinical molecular pathology tests. Arch Pathol Lab Med. 2009;133:13. - PubMed
    1. McPherson RA. Henry's Clinical Diagnosis and Management by Laboratory Methods: First South Asia Edition_e-Book. India: Elsevier Health Sciences; 2017.
    1. Parikh RB, Teeple S, Navathe AS. Addressing bias in artificial intelligence in health care. JAMA. 2019;322:2377–8. - PubMed

LinkOut - more resources