Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep:2024:937-941.
doi: 10.21437/interspeech.2024-1855.

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

Affiliations

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

Xuanru Zhou et al. Interspeech. 2024 Sep.

Abstract

Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.

Keywords: clinical; dysfluency; end-to-end; simulation.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
YOLO-Stutter method’s workflow: starting from reference text and its IPA sequence, by applying TTS-rules and VITS model, we generate the dysfluent speech along with its dysfluent alignment. Utilizing pretrained VITS speech and text encoders, we produce a soft-alignment matrix from the given reference text and mel spectrogram of dysfluent speech. Subsequently, the matrix is then processed through spatial feature aggreator and temporal dependency extractor, leading to predicted targets and bounds. The architecture of our spatial feature aggreator block is illustrated in the left green block. Here are examples of our simulated speech https://bit.ly/3PkKE8W

References

    1. Lian J, Feng C, Farooqi N, Li S, Kashyap A, Cho CJ, Wu P, Netzorg R, Li T, and Anumanchipalli GK, “Unconstrained dysfluency modeling for dysfluent speech transcription and detection,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8. - PMC - PubMed
    1. Lian J and Anumanchipalli G, “Towards hierarchical spoken language dysfluency modeling,” Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024.
    1. Pálfy J and Pospíchal J, “Pattern search in dysfluent speech,” in 2012 IEEE International Workshop on Machine Learning for Signal Processing. IEEE, 2012, pp. 1–6.
    1. Kouzelis T, Paraskevopoulos G, Katsamanis A, and Katsouros V, “Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling,” Interspeech, 2023.
    1. Brady MC, Kelly H, Godwin J, Enderby P, and Campbell P, “Speech and language therapy for aphasia following stroke,” Cochrane database of systematic reviews, no. 6, 2016. - PMC - PubMed

LinkOut - more resources