. 2023 Mar 30;3(1):42.

doi: 10.1038/s43856-023-00263-3.

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

Dani Kiyasseh¹, Jasper Laca², Taseen F Haque², Brian J Miles³, Christian Wagner⁴, Daniel A Donoho⁵, Animashree Anandkumar⁶, Andrew J Hung⁷

Affiliations

¹ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA. danikiy@hotmail.com.
² Center for Robotic Simulation and Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA.
³ Department of Urology, Houston Methodist Hospital, Houston, TX, USA.
⁴ Department of Urology, Pediatric Urology and Uro-Oncology, Prostate Center Northwest, St. Antonius-Hospital, Gronau, Germany.
⁵ Division of Neurosurgery, Center for Neuroscience, Children's National Hospital, Washington DC, USA.
⁶ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
⁷ Center for Robotic Simulation and Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA. ajhung@gmail.com.

PMID: 36997578
PMCID: PMC10063640
DOI: 10.1038/s43856-023-00263-3

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

Dani Kiyasseh et al. Commun Med (Lond). 2023.

. 2023 Mar 30;3(1):42.

doi: 10.1038/s43856-023-00263-3.

Authors

Dani Kiyasseh¹, Jasper Laca², Taseen F Haque², Brian J Miles³, Christian Wagner⁴, Daniel A Donoho⁵, Animashree Anandkumar⁶, Andrew J Hung⁷

Affiliations

¹ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA. danikiy@hotmail.com.
² Center for Robotic Simulation and Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA.
³ Department of Urology, Houston Methodist Hospital, Houston, TX, USA.
⁴ Department of Urology, Pediatric Urology and Uro-Oncology, Prostate Center Northwest, St. Antonius-Hospital, Gronau, Germany.
⁵ Division of Neurosurgery, Center for Neuroscience, Children's National Hospital, Washington DC, USA.
⁶ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
⁷ Center for Robotic Simulation and Education, Catherine & Joseph Aresty Department of Urology, University of Southern California, Los Angeles, CA, USA. ajhung@gmail.com.

PMID: 36997578
PMCID: PMC10063640
DOI: 10.1038/s43856-023-00263-3

Abstract

Background: Surgeons who receive reliable feedback on their performance quickly master the skills necessary for surgery. Such performance-based feedback can be provided by a recently-developed artificial intelligence (AI) system that assesses a surgeon's skills based on a surgical video while simultaneously highlighting aspects of the video most pertinent to the assessment. However, it remains an open question whether these highlights, or explanations, are equally reliable for all surgeons.

Methods: Here, we systematically quantify the reliability of AI-based explanations on surgical videos from three hospitals across two continents by comparing them to explanations generated by humans experts. To improve the reliability of AI-based explanations, we propose the strategy of training with explanations -TWIX -which uses human explanations as supervision to explicitly teach an AI system to highlight important video frames.

Results: We show that while AI-based explanations often align with human explanations, they are not equally reliable for different sub-cohorts of surgeons (e.g., novices vs. experts), a phenomenon we refer to as an explanation bias. We also show that TWIX enhances the reliability of AI-based explanations, mitigates the explanation bias, and improves the performance of AI systems across hospitals. These findings extend to a training environment where medical students can be provided with feedback today.

Conclusions: Our study informs the impending implementation of AI-augmented surgical training and surgeon credentialing programs, and contributes to the safe and fair democratization of surgery.

Plain language summary

Surgeons aim to master skills necessary for surgery. One such skill is suturing which involves connecting objects together through a series of stitches. Mastering these surgical skills can be improved by providing surgeons with feedback on the quality of their performance. However, such feedback is often absent from surgical practice. Although performance-based feedback can be provided, in theory, by recently-developed artificial intelligence (AI) systems that use a computational model to assess a surgeon’s skill, the reliability of this feedback remains unknown. Here, we compare AI-based feedback to that provided by human experts and demonstrate that they often overlap with one another. We also show that explicitly teaching an AI system to align with human feedback further improves the reliability of AI-based feedback on new videos of surgery. Our findings outline the potential of AI systems to support the training of surgeons by providing feedback that is reliable and focused on a particular skill, and guide programs that give surgeons qualifications by complementing skill assessments with explanations that increase the trustworthiness of such assessments.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: D.K. is a paid consultant of Flatiron Health and an employee of Vicarious Surgical. C.W. is a paid consultant of Intuitive Surgical. A.A. is an employee of Nvidia. A.J.H is a consultant of Intuitive Surgical. The remaining authors declare no competing interests.

Figures

**Fig. 1. Heatmap of the ground-truth explanation annotations across hospitals.**
We average the explanation annotations for the a, needle handling and b, needle driving video samples in the test set of the Monte Carlo folds (see Supplementary Table 2 for total number of samples), and present them over a normalized time index, where 0 and 1 reflect the beginning and end of a video sample, respectively. A darker shade (which ranges from 0 to 1 as per the colour bars) implies that a segment of time is of greater importance.

**Fig. 2. Quantifying the alignment of AI-based explanations with human explanations.**
A surgical artificial intelligence system (SAIS) can assess the skill of a surgeon based on a surgical video and generate an explanation for such an assessment by highlighting the relative importance of video frames (e.g., via an attention mechanism). Human experts annotate video frames most important for the skill assessment. TWIX is a module which uses human explanations as supervision to explicitly teach an AI system to predict the importance of video frames. We show the alignment of attention and TWIX with human explanations.

**Fig. 3. TWIX can improve the reliability of AI-based explanations across hospitals.**
Precision-recall curves reflecting the alignment of different AI-based explanations with those provided by humans when assessing the skill-level of a, needle handling and b, needle driving. Note that SAIS is trained exclusively on data from USC and then deployed on data from USC, SAH, and HMH. The solid lines and shaded areas represent the mean and standard deviation, respectively, across 10 Monte Carlo cross-validation folds.

**Fig. 4. TWIX effectively mitigates explanation bias exhibited by SAIS against surgeons.**
Reliability of attention-based explanations stratified across surgeon sub-cohorts when assessing the skill-level of a, needle handling and b, needle driving (see Supplementary Tables 3-6 for number of samples in each sub-cohort). We do not report caseload for SAH due to insufficient samples from one sub-cohort. Effect of TWIX on the reliability of AI-based explanations for the disadvantaged surgeon sub-cohort (worst-case AUPRC) when assessing the skill-level of c, needle handling and d, needle driving. AI-based explanations come in the form of attention placed on frames by SAIS or through the direct estimate of frame importance by TWIX (see Methods). We do not report caseload for SAH due to insufficient samples from one sub-cohort. Note that SAIS is trained exclusively on data from USC and then deployed on data from USC, SAH, and HMH. Results are an average across 10 Monte Carlo cross-validation folds, and errors bars reflect the 95% confidence interval.

**Fig. 5. TWIX’ benefits persist across different experimental settings.**
We present the effect of TWIX, in different experimental settings (ablation studies), on a, the reliability of explanations generated by SAIS, quantified via the AUPRC, and b, the explanation bias, quantified via improvements in the worst-case AUPRC (see Supplementary Tables 3-6 for number of samples in each sub-cohort). The default experimental setting is RGB + Flow and was used throughout this study. Other settings include withholding optical flow from SAIS (RGB) and formulating a multi-class skill assessment task (Multi-Skill). c–f SAIS can be used today to provide feedback to surgical trainees. c AI-based explanations often align with those provided by human experts. d SAIS exhibits an explanation bias against male surgical trainees. e TWIX mitigates the explanation bias by improving the reliability of explanations provided to male surgical trainees and f improves SAIS' performance in assessing the skill-level of needle handling. Note that SAIS is trained exclusively on live data from USC and then deployed on data from the training environment. Results are shown for all 10 Monte Carlo cross-validation folds.

See this image and copyright information in PMC

References

1. Ende J. Feedback in clinical medical education. JAMA. 1983;250:777–781. doi: 10.1001/jama.1983.03340060055026. - DOI - PubMed
1. Roberts KE, Bell RL, Duffy AJ. Evolution of surgical skills training. World J. Gastroenterol.: WJG. 2006;12:3219. doi: 10.3748/wjg.v12.i20.3219. - DOI - PMC - PubMed
1. Karam MD, et al. Surgical coaching from head-mounted video in the training of fluoroscopically guided articular fracture surgery. JBJS. 2015;97:1031–1039. doi: 10.2106/JBJS.N.00748. - DOI - PubMed
1. Singh P, Aggarwal R, Tahir M, Pucher PH, Darzi A. A randomized controlled study to evaluate the role of video-based coaching in training laparoscopic skills. Annals Surgery. 2015;261:862–869. doi: 10.1097/SLA.0000000000000857. - DOI - PubMed
1. Yule S, et al. Coaching non-technical skills improves surgical residents’ performance in a simulated operating room. J. Surgical Education. 2015;72:1124–1130. doi: 10.1016/j.jsurg.2015.06.012. - DOI - PubMed

Grants and funding

K23 EB034110/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

Affiliations

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

Authors

Affiliations

Abstract

Plain language summary

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources