Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr;53(2):744-756.
doi: 10.3758/s13428-020-01449-6.

A tool for efficient and accurate segmentation of speech data: announcing POnSS

Affiliations

A tool for efficient and accurate segmentation of speech data: announcing POnSS

Joe Rodd et al. Behav Res Methods. 2021 Apr.

Abstract

Despite advances in automatic speech recognition (ASR), human input is still essential for producing research-grade segmentations of speech data. Conventional approaches to manual segmentation are very labor-intensive. We introduce POnSS, a browser-based system that is specialized for the task of segmenting the onsets and offsets of words, which combines aspects of ASR with limited human input. In developing POnSS, we identified several sub-tasks of segmentation, and implemented each of these as separate interfaces for the annotators to interact with to streamline their task as much as possible. We evaluated segmentations made with POnSS against a baseline of segmentations of the same data made conventionally in Praat. We observed that POnSS achieved comparable reliability to segmentation using Praat, but required 23% less annotator time investment. Because of its greater efficiency without sacrificing reliability, POnSS represents a distinct methodological advance for the segmentation of speech data.

Keywords: Segmentation; Speech data.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
A diagrammatic representation of the annotation process. See the text for full details
Fig. 2
Fig. 2
Screenshots of the browser interfaces for the orthographic transcription (left), triage (middle), and retrimming tasks (right) in POnSS
Fig. 3
Fig. 3
Panel a: the observed distributions of the difference between segmented times and the median segmentation for each word, for POnSS and manual annotation modalities (colors). Panel b: an example of the optimized mixture-model fit (orange) to the observed distribution of one of the samples (black line). Panel c: Solid violins show the posteriors of Model 1 (see text) for the effect of modality on the sigma, with median (points), 95% HDIs (highest density intervals, thin black lines) and 66% HDIs (thick black lines)
Fig. 4
Fig. 4
Distributions of bootstrap-resampled estimates of how many annotator hours it would take to yield 5000 well-segmented words by the two modalities (translucent violins). Overlaid are solid violins showing the posteriors of Model 2 for the effect of modality, with median (points), 95% and 66% HDIs are too narrow to see in the figure

References

    1. Bartko JJ, Carpenter WT. On the methods and theory of reliability. The Journal of Nervous and Mental Disease. 1976;163(5):307. doi: 10.1097/00005053-197611000-00003. - DOI - PubMed
    1. Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psychological Reports. 1966;19(1):3–11. doi: 10.2466/pr0.1966.19.1.3. - DOI - PubMed
    1. Bhati S, Nayak S, Murty KSR, Dehak N. Unsupervised acoustic segmentation and clustering using Siamese network embeddings. Proc. Interspeech. 2019;2019:2668–2672. doi: 10.21437/Interspeech.2019-2981. - DOI
    1. Bigi B, Meunier C. Automatic segmentation of spontaneous speech. Revista de Estudos da Linguagem. 2018;26(4):1489–1530. doi: 10.17851/2237-2083.26.4.1489-1530. - DOI
    1. Boersma P, Weenink D. Praat: Doing phonetics by computer [computer program] Version 6.1.08. Amsterdam: University of Amsterdam. Retrieved from; 2019.

Publication types

LinkOut - more resources