YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
- PMID: 40620641
- PMCID: PMC12226351
- DOI: 10.21437/interspeech.2024-1855
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
Abstract
Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.
Keywords: clinical; dysfluency; end-to-end; simulation.
Figures

References
-
- Lian J and Anumanchipalli G, “Towards hierarchical spoken language dysfluency modeling,” Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024.
-
- Pálfy J and Pospíchal J, “Pattern search in dysfluent speech,” in 2012 IEEE International Workshop on Machine Learning for Signal Processing. IEEE, 2012, pp. 1–6.
-
- Kouzelis T, Paraskevopoulos G, Katsamanis A, and Katsouros V, “Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling,” Interspeech, 2023.
Grants and funding
LinkOut - more resources
Full Text Sources