Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 6;14(1):26912.
doi: 10.1038/s41598-024-77176-1.

Spatial-temporal attention for video-based assessment of intraoperative surgical skill

Affiliations

Spatial-temporal attention for video-based assessment of intraoperative surgical skill

Bohua Wan et al. Sci Rep. .

Abstract

Accurate, unbiased, and reproducible assessment of skill is a vital resource for surgeons throughout their career. The objective in this research is to develop and validate algorithms for video-based assessment of intraoperative surgical skill. Algorithms to classify surgical video into expert or novice categories provide a summative assessment of skill, which is useful for evaluating surgeons at discrete time points in their training or certification of surgeons. Using a spatial-temporal neural network architecture, we tested the hypothesis that explicit supervision of spatial attention supervised by instrument tip locations improves the algorithm's generalizability to unseen dataset. The best performing model had an area under the receiver operating characteristic curve (AUC) of 0.88. Augmenting the network with supervision of spatial attention improved specificity of its predictions (with small changes in sensitivity and AUC) and led to improved measures of discrimination when tested with unseen dataset. Our findings show that explicit supervision of attention learned from images using instrument tip locations can improve performance of algorithms for objective video-based assessment of surgical skill.

PubMed Disclaimer

Conflict of interest statement

Drs. Sikder, Hager and Vedula are listed on a patent application under review (PCT/US2022/021258 Systems and methods for assessing surgical skill having a priority date of March 25, 2021). Bohua Wan and Michael Peven declare no competing interest.

Figures

Fig. 1
Fig. 1
Comparison between supervised spatial attention map and unsupervised spatial attention map. The first row shows the supervised spatial attention maps. The attention maps are colored with the jet color map. The second row shows the unsupervised spatial attention maps. The bottom of each image shows the temporal attention map. The height of the blue bar denotes the attention weight of the corresponding frame. The red vertical line denotes the time location of the current frame. Note that there is no instrument in the fifth column and correspondingly, supervised attention (the first row) has low value on all image pixels while unsupervised attention (the second row) has high values around the bottom left corner. The temporal maps between top and bottom rows are different because they are generated from two separate models. The top row shows temporal maps generated from the model trained with supervised spatial attention, while the bottom row shows temporal maps generated from the model trained with unsupervised spatial attention.
Fig. 2
Fig. 2
An illustration of the spatial attention module, outlined by the pink dashed box. The selection and aggregation scheme are denoted by the upper and lower streams, respectively. The two streams are mutually exclusive, and only one is chosen to use in practice. The temporal feature, hi, is an optional input. The SAMG box denotes the process to compute the spatial attention map. is a dot product, and is a summation along the height and width dimensions of the attended feature map. The pathway of the multi-task learning model is denoted by the dashed arrow where the stacked green cubicles represent five layers of transposed convolutional layers. The area with the largest attention score in Aispatial is used to localize image features for the downstream temporal model. This is represented by the blue dotted line.
Fig. 3
Fig. 3
An overview of the CNN-LSTM network. The “Spatial attention” box is further illustrated in Fig. 2 and discussed in section 2.1. The “Temporal attention” box is discussed in section 2.3. Ce and Ch are the lengths of feature dimension of the image feature and the hidden state, respectively.
Fig. 4
Fig. 4
The network architecture of the no attention model.
Fig. 5
Fig. 5
Images sampled from the source dataset and the target dataset.

References

    1. Madani, A. et al. What are the principles that guide behaviors in the operating room?: Creating a framework to define and measure performance. Annals of Surgery265, 255–267 (2017). - DOI - PubMed
    1. Williams, R. G. et al. A proposed blueprint for operative performance training, assessment, and certification. Annals of surgery273, 701–708 (2021). - DOI - PubMed
    1. Buyske, J. Forks in the road: the assessment of surgeons from the american board of surgery perspective. Surgical Clinics96, 139–146 (2016). - PubMed
    1. Bilgic, E., Valanci-Aroesty, S. & Fried, G. M. Video assessment of surgeons and surgery. Advances in Surgery54, 205–214 (2020). - DOI - PubMed
    1. Pangal, D. J. et al. Expert surgeons and deep learning models can predict the outcome of surgical hemorrhage from 1 min of video. Scientific Reports12, 1–10 (2022). - DOI - PMC - PubMed

LinkOut - more resources