. 2021 Dec 24:12:54.

doi: 10.4103/jpi.jpi_6_21. eCollection 2021.

Stress Testing Pathology Models with Generated Artifacts

Nicholas Chandler Wang¹, Jeremy Kaplan¹, Joonsang Lee¹, Jeffrey Hodgin², Aaron Udager², Arvind Rao¹

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
² Department of Pathology, University of Michigan Medical School, Ann Arbor, MI, USA.

PMID: 35070483
PMCID: PMC8721870
DOI: 10.4103/jpi.jpi_6_21

Stress Testing Pathology Models with Generated Artifacts

Nicholas Chandler Wang et al. J Pathol Inform. 2021.

. 2021 Dec 24:12:54.

doi: 10.4103/jpi.jpi_6_21. eCollection 2021.

Authors

Nicholas Chandler Wang¹, Jeremy Kaplan¹, Joonsang Lee¹, Jeffrey Hodgin², Aaron Udager², Arvind Rao¹

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
² Department of Pathology, University of Michigan Medical School, Ann Arbor, MI, USA.

PMID: 35070483
PMCID: PMC8721870
DOI: 10.4103/jpi.jpi_6_21

Abstract

Background: Machine learning models provide significant opportunities for improvement in health care, but their "black-box" nature poses many risks.

Methods: We built a custom Python module as part of a framework for generating artifacts that are meant to be tunable and describable to allow for future testing needs. We conducted an analysis of a previously published digital pathology classification model and an internally developed kidney tissue segmentation model, utilizing a variety of generated artifacts including testing their effects. The artifacts simulated were bubbles, tissue folds, uneven illumination, marker lines, uneven sectioning, altered staining, and tissue tears.

Results: We found that there is some performance degradation on the tiles with artifacts, particularly with altered stains but also with marker lines, tissue folds, and uneven sectioning. We also found that the response of deep learning models to artifacts could be nonlinear.

Conclusions: Generated artifacts can provide a useful tool for testing and building trust in machine learning models by understanding where these models might fail.

Keywords: Artifact; digital pathology; failure mode; machine learning; neural network; robustness.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts of interest.

Figures

**Figure 1**
An example of seven types of simulated artifact, bubbles, tissue folds, uneven illumination, pen marks, sectioning artifacts, altered staining, and tissue tears. These artifacts are applied to the lung tissue tiles like in this example

**Figure 2**
Schematic of processing steps undertaken by our Snakemake workflow. The “manipulate” tiles step in red was only applied to experimental studies

**Figure 3**
Tile level average predicted probabilities after artifacts were added. The three tissue types were split into separate categories, and the average probability is shown relative to the control set of images. In some cases, the probabilities had more spread, indicating some tiles had higher uncertainty

**Figure 4**
Tile level area under the receiver operating characteristics with confidence intervals after artifacts were added. The three tissue types were split into separate categories, and the subclass area under the receiver operating characteristic is shown relative to the control set of images. On a tile level, predictive performance was somewhat decreased by the artifacts introduced. Note that the area under the receiver operating characteristic y-axis ranges from 0.50 to 1

**Figure 5**
Area under the receiver operating characteristic and accuracy of kidney segmentation model after addition of artifacts, broken down by tissue component type (subclass). Area under the receiver operating characteristic focuses on the change in probability score produced by the model and its effect, whereas accuracy shows the change in predicted class. The baseline performance in the control experiment is described at the top of each chart. While tissue fold, marker, and stain alterations show the biggest changes, there is a lot of variability between subclasses. The prevalence of each subclass in the ground truth varies by several orders of magnitude across the six subclasses

**Figure 6**
A selected example of the effects of artifacts on a single kidney tissue sample tile. This tile is notable for the presence of four tissue component types (subclasses) in the ground truth. Notably, the baseline model had difficulty properly identifying interstitium in this tile. Tissue folds and marker line artifacts both resulted in the “miscellaneous (Misc)” label originally designated for stain deposit artifacts

**Figure 7**
A second selected example of the effects of the impacts of artifacts. This example had only two tissue component types (subclasses) represented in the ground truth label, tubules, and interstitium. Of the ten tiles selected for further inspection, eight of ten had only these two most common labels. While tissue fold and marker artifacts did have the most effect on the labels, overall most of the artifacts applied had limited impact on this tile.

See this image and copyright information in PMC

References

1. Watson DS, Krutzinna J, Bruce IN, Griffiths CE, McInnes IB, Barnes MR, et al. Clinical applications of machine learning algorithms: Beyond the black box. BMJ. 2019;364:l886. - PubMed
1. Shamout F, Zhu T, Clifton L, Briggs J, Prytherch D, Meredith P, et al. Early warning score adjusted for age to predict the composite outcome of mortality, cardiac arrest or unplanned intensive care unit admission using observational vital-sign data: A multicentre development and validation. BMJ Open. 2019;9:e033301. - PMC - PubMed
1. Jennings L, Deerlin VM, Gulley ML. Recommended principles and practices for validating clinical molecular pathology tests. Arch Pathol Lab Med. 2009;133:13. - PubMed
1. McPherson RA. Henry's Clinical Diagnosis and Management by Laboratory Methods: First South Asia Edition_e-Book. India: Elsevier Health Sciences; 2017.
1. Parikh RB, Teeple S, Navathe AS. Addressing bias in artificial intelligence in health care. JAMA. 2019;322:2377–8. - PubMed

Grants and funding

R37 CA214955/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Stress Testing Pathology Models with Generated Artifacts

Affiliations

Stress Testing Pathology Models with Generated Artifacts

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources