. 2022 May:144:105339.

doi: 10.1016/j.compbiomed.2022.105339. Epub 2022 Feb 28.

LARNet-STC: Spatio-temporal orthogonal region selection network for laryngeal closure detection in endoscopy videos

Yang Yang Wang¹, Ali S Hamad¹, Kannappan Palaniappan¹, Teresa E Lever², Filiz Bunyak³

Affiliations

¹ Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, 65211, Missouri, USA.
² Department of Otolaryngology - Head and Neck Surgery, University of Missouri, Columbia, 65211, Missouri, USA.
³ Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, 65211, Missouri, USA. Electronic address: bunyak@missouri.edu.

PMID: 35263687
PMCID: PMC8995389
DOI: 10.1016/j.compbiomed.2022.105339

LARNet-STC: Spatio-temporal orthogonal region selection network for laryngeal closure detection in endoscopy videos

Yang Yang Wang et al. Comput Biol Med. 2022 May.

. 2022 May:144:105339.

doi: 10.1016/j.compbiomed.2022.105339. Epub 2022 Feb 28.

Authors

Yang Yang Wang¹, Ali S Hamad¹, Kannappan Palaniappan¹, Teresa E Lever², Filiz Bunyak³

Affiliations

¹ Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, 65211, Missouri, USA.
² Department of Otolaryngology - Head and Neck Surgery, University of Missouri, Columbia, 65211, Missouri, USA.
³ Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, 65211, Missouri, USA. Electronic address: bunyak@missouri.edu.

PMID: 35263687
PMCID: PMC8995389
DOI: 10.1016/j.compbiomed.2022.105339

Abstract

The vocal folds (VFs) are a pair of muscles in the larynx that play a critical role in breathing, swallowing, and speaking. VF function can be adversely affected by various medical conditions including head or neck injuries, stroke, tumor, and neurological disorders. In this paper, we propose a deep learning system for automated detection of laryngeal adductor reflex (LAR) events in laryngeal endoscopy videos to enable objective, quantitative analysis of VF function. The proposed deep learning system incorporates our novel orthogonal region selection network and temporal context. This network learns to directly map its input to a VF open/close state without first segmenting or tracking the VF region. This one-step approach drastically reduces manual annotation needs from labor-intensive segmentation masks or VF motion tracks to frame-level class labels. The proposed spatio-temporal network with an orthogonal region selection subnetwork allows integration of local image features, global image features, and VF state information in time for robust LAR event detection. The proposed network is evaluated against several network variations that incorporate temporal context and is shown to lead to better performance. The experimental results show promising performance for automated, objective, and quantitative analysis of LAR events from laryngeal endoscopy videos with over 90% and 99% F1 scores for LAR and non-LAR frames respectively.

Keywords: Deep learning; Laryngeal adductor reflex; Laryngeal closure detection; Laryngeal endoscopy; Vocal folds.

PubMed Disclaimer

Figures

**Figure 1:**
Sample images for the three vocal fold state classes.

**Figure 2:**
Sample laryngoscopy video frames illustrating different processing challenges. (a-c) Left images show original video frames, right images show corresponding histogram equalized images.

**Figure 3:**
Architecture of the proposed VFs state estimation network.

**Figure 4:**
Subregion cropping and the Orthogonal Region Selection (ORS) subnetwork. Inputs to the network are five cropped subregions (marked with yellow squares) from the preprocessed image. Output of the network is a 1-D feature vector corresponding to the selected subregion. This vector is selected from F by the index j* of the minimum value in O. “FC” represents fully-connected layer.

**Figure 5:**
Architecture of the proposed spatio-temporal context-based orthogonal region selection network. On top of the VF state estimation networks, a set of fully convolutional layers are inserted to the network to incorporate temporal context. “Conv” represents convolution operation.

**Figure 6:**
Four different architectures of spatio-temporal context-based networks.

**Figure 7:**
Boxplot of the F1 scores for the five-fold cross-validation of three proposed networks. The green triangle is the mean across five folds.

**Figure 8:**
Quantification evaluation of LAR event durations (number of frames). (a) Histogram of the distribution of ground truth LAR event durations. (b) cumulative distribution of the frame error of LAR event prediction. (c) Comparison of the ground truth and prediction of VF states for a single video. (d) Sample original video frames at timestamps A, B, C, and D in (c).

**Figure 9:**
Segmentation-derived LAR/non-LAR classification results. Average F1 scores for non-LAR frames. VFs segmentation algorithms (U-LSTM [23], FCRN [21], and FCRN [21] + histogram equalization + ORS) and the proposed LARNet-STC.

**Figure 10:**
Segmentation-derived LAR/non-LAR classification results. Average F1 scores for LAR frames. VFs segmentation algorithms (U-LSTM [23], FCRN [21], and FCRN [21] + histogram equalization + ORS) and the proposed LARNet-STC.

**Figure 11:**
Visual explanation of the LARNet-STC network output using Grad-CAM visualization [66]. Top row: subregions automatically selected by Orthogonal Region Selection (ORS) subnetwork. Bottom row: regions corresponding to high score for the predicted class marked with highlights changing from red to blue corresponding to higher to lower impact regions.

**Figure 12:**
The confusion matrix of the results from the proposed context-based orthogonal region selection network (LARNet-STC).

**Figure 13:**
Sample outputs from the proposed system. Red label represents ground truth. Green label represents prediction.

**Figure 14:**
Sampled non-LAR sequential video frames (frame 6–10) with visual occlusion from laryngoscopy videos.

See this image and copyright information in PMC

References

1. Sasaki CT, Weaver EM, Physiology of the larynx, The American Journal of Medicine 103 (5) (1997) 9S–18S. - PubMed
1. Dankbaar J, Pameijer F, Vocal cord paralysis: anatomy, imaging and pathology, Insights into imaging 5 (6) (2014) 743–751. - PMC - PubMed
1. Weinberger M, Doshi D, Vocal cord dysfunction: a functional cause of respiratory distress, Breathe 13 (1) (2017) 15–21. - PMC - PubMed
1. Rajaei A, Barzegar B. E, Mojiri F, Nilforoush MH, The occurrence of laryngeal penetration and aspiration in patients with glottal closure insufficiency, ISRN Otolaryngology 2014. - PMC - PubMed
1. Toutounchi SJS, Eydi M, Golzari SE, Ghaffari MR, Parvizian N, Vocal cord paralysis and its etiologies: a prospective study, J. Cardiovascular and Thoracic Research 6 (1) (2014) 47. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 NS110915/NS/NINDS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LARNet-STC: Spatio-temporal orthogonal region selection network for laryngeal closure detection in endoscopy videos

Affiliations

LARNet-STC: Spatio-temporal orthogonal region selection network for laryngeal closure detection in endoscopy videos

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous