. 2024 Jul;38(4):951-962.

doi: 10.1016/j.jvoice.2022.01.028. Epub 2022 Mar 16.

Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach

Ahmed M Yousef¹, Dimitar D Deliyski¹, Stephanie R C Zacharias², Maryam Naghibolhosseini³

Affiliations

¹ Department of Communicative Sciences and Disorders, Michigan State University, East Lansing, Michigan.
² Head and Neck Regenerative Medicine Program, Mayo Clinic, Scottsdale, Arizona; Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Phoenix, Arizona.
³ Department of Communicative Sciences and Disorders, Michigan State University, East Lansing, Michigan. Electronic address: naghib@msu.edu.

PMID: 35304042
PMCID: PMC9474736
DOI: 10.1016/j.jvoice.2022.01.028

Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach

Ahmed M Yousef et al. J Voice. 2024 Jul.

. 2024 Jul;38(4):951-962.

doi: 10.1016/j.jvoice.2022.01.028. Epub 2022 Mar 16.

Authors

Ahmed M Yousef¹, Dimitar D Deliyski¹, Stephanie R C Zacharias², Maryam Naghibolhosseini³

Affiliations

¹ Department of Communicative Sciences and Disorders, Michigan State University, East Lansing, Michigan.
² Head and Neck Regenerative Medicine Program, Mayo Clinic, Scottsdale, Arizona; Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Phoenix, Arizona.
³ Department of Communicative Sciences and Disorders, Michigan State University, East Lansing, Michigan. Electronic address: naghib@msu.edu.

PMID: 35304042
PMCID: PMC9474736
DOI: 10.1016/j.jvoice.2022.01.028

Abstract

Objective: Adductor spasmodic dysphonia (AdSD) is a neurogenic voice disorder, affecting the intrinsic laryngeal muscle control. AdSD leads to involuntary laryngeal spasms and only reveals during connected speech. Laryngeal high-speed videoendoscopy (HSV) coupled with a flexible fiberoptic endoscope provides a unique opportunity to study voice production and visualize the vocal fold vibrations in AdSD during speech. The goal of this study is to automatically detect instances during which the image of the vocal folds is optically obstructed in HSV recordings obtained during connected speech.

Methods: HSV data were recorded from vocally normal adults and patients with AdSD during reading of the "Rainbow Passage", six CAPE-V sentences, and production of the vowel /i/. A convolutional neural network was developed and trained as a classifier to detect obstructed/unobstructed vocal folds in HSV frames. Manually labelled data were used for training, validating, and testing of the network. Moreover, a comprehensive robustness evaluation was conducted to compare the performance of the developed classifier and visual analysis of HSV data.

Results: The developed convolutional neural network was able to automatically detect the vocal fold obstructions in HSV data in vocally normal participants and AdSD patients. The trained network was tested successfully and showed an overall classification accuracy of 94.18% on the testing dataset. The robustness evaluation showed an average overall accuracy of 94.81% on a massive number of HSV frames demonstrating the high robustness of the introduced technique while keeping a high level of accuracy.

Conclusions: The proposed approach can be used for efficient analysis of HSV data to study laryngeal maneuvers in patients with AdSD during connected speech. Additionally, this method will facilitate development of vocal fold vibratory measures for HSV frames with an unobstructed view of the vocal folds. Indicating parts of connected speech that provide an unobstructed view of the vocal folds can be used for developing optimal passages for precise HSV examination during connected speech and subject-specific clinical voice assessment protocols.

Keywords: Laryngeal imaging—Connected speech—High-speed videoendoscopy—Adductor spasmodic dysphonia—Vocal fold obstruction—Convolutional neural network.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no conflict of interest.

Figures

**Figure 1:**
A Schematic diagram for the automated deep learning approach, developed in this work. The HSV video frames serve as the input to the automated classifier. The detailed structure of the convolutional neural network is illustrated. The input frames are processed through several layers of 3×3 convolutions combined with rectified linear unit (ReLU) layers (in dark blue), followed by multiple 2×2 max pooling layers (in orange). The last layer includes a sigmoid layer (in green). The dimensions of the feature maps corresponding to the different convolutional layers are also included in the figure. The neural network classifies each frame into two classes as a classification output: either a frame with unobstructed vocal fold/s or a frame with obstructed vocal fold/s.

**Figure 2:**
A sample of classified HSV images during connected using the manual analysis (visual classification). The two sets of three columns display the two different groups of frames: “Unobstructed Vocal Fold” showing the presence of the true vocal folds and “Obstructed Vocal Fold” demonstrating an obstructed view of vocal fold/s.

**Figure 3:**
The classification results using the automated deep learning approach on the testing dataset. The two sets of three columns display the correctly classified frames of the testing dataset as “Unobstructed Vocal Fold” (left side panels) and “Obstructed Vocal Fold” (right side panels).

**Figure 4:**
Confusion matrices of the deep learning network, showing its performance on classification of the validation dataset (panel A) and the testing dataset (panel B). Blue and orange cells refer to the number of frames/images in each category, and the green cells represent the associated accuracy of each row and column – noting that the overall classifier’s accuracy is highlighted in the dark green cells. The horizontal labels represent the predicted outcome of the classifier on the “Unobstructed Vocal Fold” class (VF) and “Obstructed Vocal Fold” class (No VF). The vertical labels refer to the ground-truth labels observed by the rater for each class.

**Figure 5:**
The sensitivity-specificity curve (receiver operating characteristics curve), in blue, for the validation dataset (panel A) and the testing dataset (panel B). AUC refers to the area under the sensitivity-specificity curve. The diagonal red line represents points where Sensitivity=1-Specificity.

**Figure 6:**
Confusion matrix of the developed deep learning network for classification of HSV recordings of a vocally normal participant (panel A) and a patient with AdSD (panel B). The blue and orange cells refer to the number of frames/images in each category, and the green cells represent the associated accuracy of each row and column – noting that the overall classifier accuracy is highlighted in the dark green cell. The horizontal labels represent the predicted outcome of the classifier on the “Unobstructed Vocal Fold” class (VF) and “Obstructed Vocal Fold” class (No VF). The vertical labels refer to the ground-truth labels, which are visually/manually observed for each class.

**Figure 7:**
The sensitivity-specificity curve (receiver operating characteristics curve), in blue, of the developed deep learning network performance on binary classification of the entire two HSV videos of a vocally normal participant (panel A) and a patient with AdSD (panel B). AUC refers to the area under the sensitivity-specificity curve.

**Figure 8:**
Comparison between automated (in blue) and manual (in red) analysis of the instances during which vocal fold/s are obstructed. The comparison shown for the entire two HSV videos of a vocally normal participant (panel A) and a patient with AdSD (panel B). The accumulated overall accuracy (in solid black line), precision of detecting obstructed view of vocal fold/s (in dotted brown line), and precision of detecting unobstructed view of vocal fold/s (in dashed green line) are also illustrated.

See this image and copyright information in PMC

References

1. Chetri DK, Merati AL, Blumin JH, Sulica L, Damrose EJ and Tsai VW, “Reliability of the perceptual evaluation of adductor spasmodic dysphonia,” An Otol Rhinol Laryngol, vol. 117, pp. 159–165, 2008. - PubMed
1. Roy N, Gouse M, Mauszycki SC, Merrill RM and Smith ME, “Task specificity in adductor spasmodic dysphonia versus muscle tension dysphonia,” The Laryngoscope, vol. 115, no. 2, pp. 311–316, 2005. - PubMed
1. Roy N, Mazin A and Awan SN, “Automated acoustic analysis of task dependency in adductor spasmodic dysphonia versus muscle tension dysphonia,” The Laryngoscope, vol. 124, no. 3, pp. 718–724, 2014. - PubMed
1. Boutsen F, Cannito MP, Taylor M and Bender B, “Botox treatment in adductor spasmodic dysphonia: a meta-analysis,” J Sp Lang Hear Res, vol. 45, pp. 469–481, 2002. - PubMed
1. Morrison MD and Rammage LA, “Muscle misuse voice disorders: description and classification,” Acta oto-laryngologica, vol. 113, no. 3, pp. 428–434, 1993. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

K01 DC017751/DC/NIDCD NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach

Affiliations

Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources