Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 17:7:1456486.
doi: 10.3389/frai.2024.1456486. eCollection 2024.

Human-centered evaluation of explainable AI applications: a systematic review

Affiliations

Human-centered evaluation of explainable AI applications: a systematic review

Jenia Kim et al. Front Artif Intell. .

Abstract

Explainable Artificial Intelligence (XAI) aims to provide insights into the inner workings and the outputs of AI systems. Recently, there's been growing recognition that explainability is inherently human-centric, tied to how people perceive explanations. Despite this, there is no consensus in the research community on whether user evaluation is crucial in XAI, and if so, what exactly needs to be evaluated and how. This systematic literature review addresses this gap by providing a detailed overview of the current state of affairs in human-centered XAI evaluation. We reviewed 73 papers across various domains where XAI was evaluated with users. These studies assessed what makes an explanation "good" from a user's perspective, i.e., what makes an explanation meaningful to a user of an AI system. We identified 30 components of meaningful explanations that were evaluated in the reviewed papers and categorized them into a taxonomy of human-centered XAI evaluation, based on: (a) the contextualized quality of the explanation, (b) the contribution of the explanation to human-AI interaction, and (c) the contribution of the explanation to human-AI performance. Our analysis also revealed a lack of standardization in the methodologies applied in XAI user studies, with only 19 of the 73 papers applying an evaluation framework used by at least one other study in the sample. These inconsistencies hinder cross-study comparisons and broader insights. Our findings contribute to understanding what makes explanations meaningful to users and how to measure this, guiding the XAI community toward a more unified approach in human-centered explainability.

Keywords: XAI; XAI evaluation; explainable AI; human-AI interaction; human-AI performance; human-centered evaluation; meaningful explanations; systematic review.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
The high-level categories in existing taxonomies for XAI evaluation. Doshi-Velez and Kim (2018), Hoffman et al. (2018, 2023), and Lopes et al. (2022) discuss both evaluation with users (green) and without users (orange); Nauta et al. (2023) focus on evaluation without users (orange); Zhou et al. (2021), Vilone and Longo (2021), and Mohseni et al. (2021) focus on evaluation with users (green).
Figure 2
Figure 2
Search query for the systematic literature review.
Figure 3
Figure 3
PRISMA flow diagram.
Figure 4
Figure 4
The labeling scheme used for data extraction. (A) Labeling scheme of AI systems. (B) Labeling scheme of XAI methods. (C) Labeling scheme of user studies.
Figure 5
Figure 5
Distribution of papers by application domain and type user study.
Figure 6
Figure 6
The proposed taxonomy of human-centered evaluation of XAI. The blue, orange and red boxes contain the 30 evaluation measures identified in the reviewed papers. They are grouped based on the aspect of human-centered explanation quality that they evaluate: (A) in-context quality of the explanation, (B) contribution of the explanation to human-AI interaction, and (C) contribution of the explanation to human-AI performance. An additional aspect, the a priori explanation quality, is not covered by our review since it is not evaluated with users (expl = explanation; sys = system).
Figure 7
Figure 7
The number of studies in which each meaningfulness component is evaluated (total: 77 studies in 73 papers).
Figure 8
Figure 8
Our taxonomy, compared to other XAI evaluation frameworks. The main difference is in the level of detail of the evaluated properties, and in the novel categorization into the three dimensions of human-centered evaluation.

References

    1. Abdul A., von der Weth C., Kankanhalli M., Lim B. Y. (2020). “COGAM: measuring and moderating cognitive load in machine learning model explanations,” in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (New York, NY: ACM; ), 1–14. 10.1145/3313831.3376615 - DOI
    1. Adhikari A., Tax D. M., Satta R., Faeth M. (2019). “LEAFAGE: example-based and feature importance-based explanations for black-box ML models,” in 2019 IEEE international conference on fuzzy systems (FUZZ-IEEE) (New Orleans, LA: IEEE; ), 1–7. 10.1109/FUZZ-IEEE.2019.8858846 - DOI
    1. Aechtner J., Cabrera L., Katwal D., Onghena P., Valenzuela D. P., Wilbik A., et al. . (2022). “Comparing user perception of explanations developed with XAI methods,” in 2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (Padua: IEEE; ), 1–7. 10.1109/FUZZ-IEEE55066.2022.9882743 - DOI
    1. Alufaisan Y., Marusich L. R., Bakdash J. Z., Zhou Y., Kantarcioglu M. (2021). Does explainable artificial intelligence improve human decision-making? Proc. AAAI Conf. Artif. Intell. 35, 6618–6626. 10.1609/aaai.v35i8.16819 - DOI
    1. Anjara S. G., Janik A., Dunford-Stenger A., Mc Kenzie K., Collazo-Lorduy A., Torrente M., et al. . (2023). Examining explainable clinical decision support systems with think aloud protocols. PLoS ONE 18:e0291443. 10.1371/journal.pone.0291443 - DOI - PMC - PubMed

Publication types

LinkOut - more resources