This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Mar 20:2020.11.23.20235945.

doi: 10.1101/2020.11.23.20235945.

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

Daniel M Low^{1

2}, Vishwanatha Rao^{3

4}, Gregory Randolph^{4

5}, Phillip C Song^{4

5}, Satrajit S Ghosh^{1

2

5}

Affiliations

¹ Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA, USA.
² McGovern Institute for Brain Research, MIT, Cambridge, MA, USA.
³ Department of Biomedical Engineering, Columbia University, New York, NY, USA.
⁴ Department of Otolaryngology-Head and Neck Surgery, Massachusetts Eye and Ear Infirmary, Boston, MA, USA.
⁵ Department of Otolaryngology-Head and Neck Surgery, Harvard Medical School, Boston, MA, USA.

PMID: 33501466
PMCID: PMC7836138
DOI: 10.1101/2020.11.23.20235945

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

Daniel M Low et al. medRxiv. 2024.

[Preprint]. 2024 Mar 20:2020.11.23.20235945.

doi: 10.1101/2020.11.23.20235945.

Authors

Daniel M Low^{1

2}, Vishwanatha Rao^{3

4}, Gregory Randolph^{4

5}, Phillip C Song^{4

5}, Satrajit S Ghosh^{1

2

5}

Affiliations

¹ Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA, USA.
² McGovern Institute for Brain Research, MIT, Cambridge, MA, USA.
³ Department of Biomedical Engineering, Columbia University, New York, NY, USA.
⁴ Department of Otolaryngology-Head and Neck Surgery, Massachusetts Eye and Ear Infirmary, Boston, MA, USA.
⁵ Department of Otolaryngology-Head and Neck Surgery, Harvard Medical School, Boston, MA, USA.

PMID: 33501466
PMCID: PMC7836138
DOI: 10.1101/2020.11.23.20235945

Update in

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings.
Low DM, Rao V, Randolph G, Song PC, Ghosh SS. Low DM, et al. PLOS Digit Health. 2024 May 30;3(5):e0000516. doi: 10.1371/journal.pdig.0000516. eCollection 2024 May. PLOS Digit Health. 2024. PMID: 38814939 Free PMC article.

Abstract

Introduction: Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance.

Methods: Patients with confirmed UVFP through endoscopic examination (N=77) and controls with normal voices matched for age and sex (N=77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel "a". Four machine learning models of differing complexity were used. SHapley Additive explanations (SHAP) was used to identify important features.

Results: The highest median bootstrapped ROC AUC score was 0.87 and beat clinician's performance (range: 0.74 - 0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis.

Conclusion: We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician's ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.

Keywords: acoustic analysis; bias; explainability; interpretability; machine learning; speech; vocal fold paralysis; voice.

PubMed Disclaimer

Figures

**Figure 1.. Schematic of speech production and the process of extracting certain acoustic features from an audio signal.**
(A) Speech production, (B) recording characteristics, (C) waveform of audio signal with fundamental frequency (f0), (D) spectrogram with formants F1-F3 and intensity, (E) mel-frequency cepstral coefficients (MFCCs). Full description in the main text.

**Figure 2.. Distribution of audio duration for reading and vowel tasks split by group reveals a dataset bias.**
The mode of the audio durations for the controls is 3.5 s for reading samples and 4.11 s for vowel samples.

**Figure 3.. Model performance comparison using a permutation test using non-redundant features.**
Scores from models trained on true labels (blue) and trained on permuted labels (orange) over bootstrapping splits.

**Figure 4.. Feature importance parallel coordinate plot.**
Rank reads from bottom (most important) to top (least important). Mean rank is weighted by performance of each model to avoid a lower performing model biasing the mean rank.

**Figure 5.. Distributions for top 5 features and corresponding performance for single features.**
Logistic Regression with L1 penalty was used. No single feature is enough to dissociate groups with high performance. Null models’ median performance was 0.5.

**Figure 6.. Feature redundancy with top 5 features highlighted.**
Top 5 features are highlighted in bold and their rank is displayed. Squares are clusters of redundant features. Computed with all participants on the reading task.

**Figure 7.. Descriptive statistics and inter-rater reliability of clinician ratings for unilateral vocal fold paralysis (UVFP), background noise, and recording loudness indicating likely bias.**
Controls and UVFP are ground truth diagnosis from the full clinical interview. Ratings are on brief reading samples. Bars indicate maximum and minimum count across the three raters. The disproportionate amount of UVFP samples rated as having high background noise and high loudness indicates likely bias, where the gain might have been raised for some UVFP patients and they may have phonated more intensely. kappa: Light’s kappa; ICC: intra-class correlation coefficient.

**Figure 8.. How clinicians rate the audio recordings of read speech: descriptive statistics and inter-rater reliability of average clinician ratings.**
The average across raters was taken for each recording. ICC: intra-class correlation coefficient.

See this image and copyright information in PMC

References

1. Wroge TJ, Özkanca Y, Demiroglu C, Si D. Parkinson’s disease diagnosis using machine learning and voice. 2018 IEEE signal [Internet], 2018.
1. Low DM, Bentley KH, Ghosh SS. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig Otolaryngol. 2020. Feb;5(1):96–116. - PMC - PubMed
1. Quatieri TF. Discrete-Time Speech Signal Processing: Principles and Practice. Pearson Education; 2008. 816 p.
1. Molnar C. Interpretable Machine Learning. Lulu.com; 2019. 319 p.
1. Stachler RJ, Francis DO, Schwartz SR, Damask CC, Digoy GP, Krouse HJ, et al. Clinical practice guideline: Hoarseness (dysphonia) (update). Otolaryngol Head Neck Surg. 2018. Mar;158(1_suppl):S1–42. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

Affiliations

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources