. 2024 Dec 28;14(1):31150.

doi: 10.1038/s41598-024-82501-9.

Explainable AI improves task performance in human-AI collaboration

Julian Senoner^#^{1

2}, Simon Schallmoser^#^{3

4}, Bernhard Kratzwald^{1

2}, Stefan Feuerriegel^{3

4}, Torbjørn Netland⁵

Affiliations

¹ ETH Zurich, Zurich, Switzerland.
² EthonAI, Zurich, Switzerland.
³ LMU Munich, Munich, Germany.
⁴ Munich Center for Machine Learning (MCML), Munich, Germany.
⁵ ETH Zurich, Zurich, Switzerland. tnetland@ethz.ch.

^# Contributed equally.

PMID: 39730794
PMCID: PMC11681242
DOI: 10.1038/s41598-024-82501-9

Explainable AI improves task performance in human-AI collaboration

Julian Senoner et al. Sci Rep. 2024.

. 2024 Dec 28;14(1):31150.

doi: 10.1038/s41598-024-82501-9.

Authors

Julian Senoner^#^{1

2}, Simon Schallmoser^#^{3

4}, Bernhard Kratzwald^{1

2}, Stefan Feuerriegel^{3

4}, Torbjørn Netland⁵

Affiliations

¹ ETH Zurich, Zurich, Switzerland.
² EthonAI, Zurich, Switzerland.
³ LMU Munich, Munich, Germany.
⁴ Munich Center for Machine Learning (MCML), Munich, Germany.
⁵ ETH Zurich, Zurich, Switzerland. tnetland@ethz.ch.

^# Contributed equally.

PMID: 39730794
PMCID: PMC11681242
DOI: 10.1038/s41598-024-82501-9

Abstract

Artificial intelligence (AI) provides considerable opportunities to assist human work. However, one crucial challenge of human-AI collaboration is that many AI algorithms operate in a black-box manner where the way how the AI makes predictions remains opaque. This makes it difficult for humans to validate a prediction made by AI against their own domain knowledge. For this reason, we hypothesize that augmenting humans with explainable AI improves task performance in human-AI collaboration. To test this hypothesis, we implement explainable AI in the form of visual heatmaps in inspection tasks conducted by domain experts. Visual heatmaps have the advantage that they are easy to understand and help to localize relevant parts of an image. We then compare participants that were either supported by (a) black-box AI or (b) explainable AI, where the latter supports them to follow AI predictions when the AI is accurate or overrule the AI when the AI predictions are wrong. We conducted two preregistered experiments with representative, real-world visual inspection tasks from manufacturing and medicine. The first experiment was conducted with factory workers from an electronics factory, who performed [Formula: see text] assessments of whether electronic products have defects. The second experiment was conducted with radiologists, who performed [Formula: see text] assessments of chest X-ray images to identify lung lesions. The results of our experiments with domain experts performing real-world tasks show that task performance improves when participants are supported by explainable AI with heatmaps instead of black-box AI. We find that explainable AI as a decision aid improved the task performance by 7.7 percentage points (95% confidence interval [CI]: 3.3% to 12.0%, [Formula: see text]) in the manufacturing experiment and by 4.7 percentage points (95% CI: 1.1% to 8.3%, [Formula: see text]) in the medical experiment compared to black-box AI. These gains represent a significant improvement in task performance.

Keywords: Decision-making; Explainable AI; Human-centered AI; Human–AI collaboration; Task performance.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Figure 1**
Overview of the experiments for assessing the effect of explainable AI on task performance. ( A ) Experimental design of the manufacturing experiment where factory workers were asked to “approve” images of faultless products and to “reject” images of defective products through a computer interface. ( B ) Experimental design of the medical experiment where radiologists were asked to decide whether lung lesions are visible in the chest X-ray image. In both experiments, participants were randomly assigned to one of the two treatments: (a) black-box AI or (b) explainable AI.

**Figure 2**
Results of manufacturing experiment. The boxplots compare the task performance between the two treatments: black-box AI and explainable AI. The task performance is measured by the balanced accuracy (A) and the defect detection rate (B) based on the quality assessment of workers and the ground-truth labels of the product images. A balanced accuracy of 50% provides a naïve baseline corresponding to a random guess (black dotted line). The standalone AI algorithm attains a balanced accuracy of 95.6% and a defect detection rate of 92.9% (orange dashed lines). Statistical significance is based on a one-sided Welch’s t-test (^***, ^**, ^*). In the boxplots, the center line denotes the median; box limits are upper and lower quartiles; whiskers are defined as the 1.5x interquartile range.

formula image — **Figure 2**
Results of manufacturing experiment. The boxplots compare the task performance between the two treatments: black-box AI and explainable AI. The task performance is measured by the balanced accuracy (A) and the defect detection rate (B) based on the quality assessment of workers and the ground-truth labels of the product images. A balanced accuracy of 50% provides a naïve baseline corresponding to a random guess (black dotted line). The standalone AI algorithm attains a balanced accuracy of 95.6% and a defect detection rate of 92.9% (orange dashed lines). Statistical significance is based on a one-sided Welch’s t-test (^***, ^**, ^*). In the boxplots, the center line denotes the median; box limits are upper and lower quartiles; whiskers are defined as the 1.5x interquartile range.

**Figure 3**
Results of medical experiment. The boxplots compare the task performance between the two treatments: black-box AI and explainable AI. The task performance is measured by the balanced accuracy (A) and the disease detection rate (B) based on the quality assessment of radiologists and the ground-truth labels of the chest X-ray images. A balanced accuracy of 50% provides a naïve baseline corresponding to a random guess (black dotted line). The standalone AI algorithm attains a balanced accuracy of 82.2% and a disease detection rate of 71.4% (orange dashed lines). Statistical significance is based on a one-sided Welch’s t-test (^***, ^**, ^*). In the boxplots, the center line denotes the median; box limits are upper and lower quartiles; whiskers are defined as the 1.5x interquartile range.

See this image and copyright information in PMC

References

1. Brynjolfsson, E. & Mitchell, T. What can machine learning do? Workforce implications. Science358, 1530–1534 (2017). - PubMed
1. Perrault, R. & Clark, J. Artificial intelligence index report 2024. Human-centered artificial intelligence. United States of America. https://policycommons.net/artifacts/12089781/hai_ai-index-report-2024/12.... Accessed 26 Apr 2024. CID: 20.500.12592/h70s46h (2024).
1. Bertolini, M., Mezzogori, D., Neroni, M. & Zammori, F. Machine learning for industrial applications: A comprehensive literature review. Expert Syst. Appl.175, 114820 (2021).
1. Scheetz, J. et al. A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology. Sci. Rep.11, 5193 (2021). - PMC - PubMed
1. Cazzaniga, M., Jaumotte, F., Li, L., Melina, G., Panton, A. J., Pizzinelli, C., Rockall, E. J. & Tavares, M. M. Gen-AI: Artificial intelligence and the future of work. International Monetary Fund. Staff Discussion Notes 2024/001. https://www.imf.org/en/Publications/Staff-Discussion-Notes/Issues/2024/0... (2024).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

186932/Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Explainable AI improves task performance in human-AI collaboration

Affiliations

Explainable AI improves task performance in human-AI collaboration

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources