Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec:2023:10.1109/asru57964.2023.10389771.
doi: 10.1109/asru57964.2023.10389771.

UNCONSTRAINED DYSFLUENCY MODELING FOR DYSFLUENT SPEECH TRANSCRIPTION AND DETECTION

Affiliations

UNCONSTRAINED DYSFLUENCY MODELING FOR DYSFLUENT SPEECH TRANSCRIPTION AND DETECTION

Jiachen Lian et al. Proc IEEE Workshop Autom Speech Recognit Underst. 2023 Dec.

Abstract

Dysfluent speech modeling requires time-accurate and silence-aware transcription at both the word-level and phonetic-level. However, current research in dysfluency modeling primarily focuses on either transcription or detection, and the performance of each aspect remains limited. In this work, we present an unconstrained dysfluency modeling (UDM) approach that addresses both transcription and detection in an automatic and hierarchical manner. UDM eliminates the need for extensive manual annotation by providing a comprehensive solution. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.

Keywords: detection; dysfluent speech; transcription.

PubMed Disclaimer

Figures

Fig. 1:
Fig. 1:
Unconstrained Dysfluency Modeling (Transcription and Detection) for aphasia speech instance. Here is an example of aphasia speech. The reference text is ”You wish to know all about my grandfather,” while the human transcription or ground truth differs significantly from the reference. Whisper recognizes it as perfect speech, while UFA is able to capture most of the dysfluency patterns. An audio sample of this can be found here.
Fig. 2:
Fig. 2:
UFA Module
Fig. 3:
Fig. 3:
Phonetic-Level Dysfluency Detection. Audio samples can be found here.

References

    1. Brady Marian C, Kelly Helen, Godwin Jon, Enderby Pam, and Campbell Pauline, “Speech and language therapy for aphasia following stroke,” Cochrane database of systematic reviews, , no. 6, 2016. - PMC - PubMed
    1. Snowling Margaret J and Stackhouse Joy, Dyslexia, speech and language: a practitioner’s handbook, John Wiley & Sons, 2013.
    1. Pálfy Juraj and Pospíchal Jiří, “Pattern search in dysfluent speech,” in 2012 IEEE International Workshop on Machine Learning for Signal Processing. IEEE, 2012, pp. 1–6.
    1. Pitt Mark A, Johnson Keith, Hume Elizabeth, Kiesling Scott, and Raymond William, “The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability,” Speech Communication, vol. 45, no. 1, pp. 89–95, 2005.
    1. Kouzelis Theodoros, Paraskevopoulos Georgios, Katsamanis Athanasios, and Katsouros Vassilis, “Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling,” Interspeech, 2023.