This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Jun 24:2025.06.24.25330212.

doi: 10.1101/2025.06.24.25330212.

Auditor Models to Suppress Poor AI Predictions Can Improve Human-AI Collaborative Performance

Katherine E Brown¹, Jesse O Wrenn^{1

2}, Nicholas J Jackson¹, Michael R Cauley¹, Benjamin Collins¹, Laurie Lovett Novak¹, Bradley A Malin^{1

3

4}, Jessica S Ancker¹

Affiliations

¹ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.
² Department of Emergency Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.
³ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee.
⁴ Department of Computer Science, Vanderbilt University, Nashville, Tennessee.

PMID: 40666330
PMCID: PMC12262782
DOI: 10.1101/2025.06.24.25330212

Auditor Models to Suppress Poor AI Predictions Can Improve Human-AI Collaborative Performance

Katherine E Brown et al. medRxiv. 2025.

[Preprint]. 2025 Jun 24:2025.06.24.25330212.

doi: 10.1101/2025.06.24.25330212.

Authors

Katherine E Brown¹, Jesse O Wrenn^{1

2}, Nicholas J Jackson¹, Michael R Cauley¹, Benjamin Collins¹, Laurie Lovett Novak¹, Bradley A Malin^{1

3

4}, Jessica S Ancker¹

Affiliations

¹ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.
² Department of Emergency Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.
³ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee.
⁴ Department of Computer Science, Vanderbilt University, Nashville, Tennessee.

PMID: 40666330
PMCID: PMC12262782
DOI: 10.1101/2025.06.24.25330212

Abstract

Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness - inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression - silencing predictions based on auditing the ML - shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.

Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58,817) and the MIMIC-IV-ED dataset (n = 363,145) to predict likelihood of death or ICU transfer and likelihood of 30-day readmission. Our simulation study used gradient-boosted trees as well as an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of CDS alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.

Results: When the ML outperforms humans, suppression outperforms the human alone (p < 0.034) and at least does not degrade fairness. When the human outperforms the ML, suppression outperforms the human (p < 5.2 × 10^-5) but the human is fairer than suppression (p < 0.0019). Finally, incorporating uncertainty quantification into suppression approaches can improve performance.

Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.

Keywords: artificial intelligence; human-AI collaboration; machine learning.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST STATEMENT The authors have no conflicts of interest to disclose.

Figures

**Figure 1.**
Schematic indicating the collaboration scenario with and without suppression.

**Figure 2.**
Fairness-utility tradeoff plots depicting the average absolute odds difference on the y-axis and the performance in area under the ROC curve on the x-axis. Error bars depicting 95% CI are included. Prediction task: ED Triage.

**Figure 3.**
Fairness-utility tradeoff plots depicting the average absolute odds difference on the y-axis and the performance in area under the ROC curve on the x-axis. Error bars depicting 95% CI are included. Prediction Task: ED Discharge.

**Figure 4.**
Heatmap of p-values resulting from the Mann-Whitney U Test for statistical significance. The p-value is for the test that the model given by the row is higher performing or fairer than the model given by the column. Task: ED Triage.

**Figure 5.**
Heatmap of p-values resulting from the Mann-Whitney U Test for statistical significance. The p-value is for the test that the model given by the row is higher performing or fairer than the model given by the column. Task: ED Discharge

See this image and copyright information in PMC

References

1. Magrabi F, Ammenwerth E, McNair JB, Keizer NFD, Hyppönen H, Nykänen P, et al. Artificial Intelligence in Clinical Decision Support: Challenges for Evaluating AI and Practical Implications. Yearb Med Inform. 2019. Aug;28(1):128–34. - PMC - PubMed
1. Montani S, Striani M. Artificial Intelligence in Clinical Decision Support: a Focused Literature Survey. Yearb Med Inform. 2019. Aug;28(1):120–7. - PMC - PubMed
1. Ramgopal S, Sanchez-Pinto LN, Horvat CM, Carroll MS, Luo Y, Florin TA. Artificial intelligence-based clinical decision support in pediatrics. Pediatr Res. 2023. Jan;93(2):334–41. - PMC - PubMed
1. Shortliffe EH, Sepúlveda MJ. Clinical Decision Support in the Era of Artificial Intelligence. JAMA. 2018. Dec 4;320(21):2199–200. - PubMed
1. Yang Q, Steinfeld A, Zimmerman J. Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems [Internet]. New York, NY, USA: Association for Computing Machinery; 2019. [cited 2024 Sep 23]. p. 1–11. (CHI ‘19). Available from: https://dl.acm.org/doi/10.1145/3290605.3300468 - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Auditor Models to Suppress Poor AI Predictions Can Improve Human-AI Collaborative Performance

Affiliations

Auditor Models to Suppress Poor AI Predictions Can Improve Human-AI Collaborative Performance

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources