Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018;34(1):43-68.
doi: 10.1080/23273798.2018.1500698. Epub 2018 Jul 30.

Structure in talker variability: How much is there and how much can it help?

Affiliations

Structure in talker variability: How much is there and how much can it help?

Dave F Kleinschmidt. Lang Cogn Neurosci. 2018.

Abstract

One of the persistent puzzles in understanding human speech perception is how listeners cope with talker variability. One thing that might help listeners is structure in talker variability: rather than varying randomly, talkers of the same gender, dialect, age, etc. tend to produce language in similar ways. Listeners are sensitive to this covariation between linguistic variation and socio-indexical variables. In this paper I present new techniques based on ideal observer models to quantify (1) the amount and type of structure in talker variation (informativity of a grouping variable), and (2) how useful such structure can be for robust speech recognition in the face of talker variability (the utility of a grouping variable). I demonstrate these techniques in two phonetic domains-word-initial stop voicing and vowel identity-and show that these domains have different amounts and types of talker variability, consistent with previous, impressionistic findings. An R package (phondisttools) accompanies this paper, and the source and data are available from osf.io/zv6e3.

Keywords: Speech perception; computational modelling; variability.

PubMed Disclaimer

Conflict of interest statement

Disclosure statement No potential conflict of interest was reported by the authors.

Figures

Figure 1.
Figure 1.
How well a listener can recognise the phonetic category (A, e.g. /s/ vs. /ʃ/; loosely based on Newman et al., 2001) a talker is producing depends on what the listener knows about the underlying cue distributions (B). These distributions vary across talkers, which results in variability in the best category boundary. Each talker’s cue distributions can be characterised by their parameters (C; e.g. the mean of /s/, mean of /ʃ/, variance of /s/, etc.; together denoted θ). Each point in C corresponds to a pair of distributions in B and one category boundary in A. Groups of talkers are thus distributions in this high-dimensional space (C, ellipses); marginalising (averaging) over a group smears out the category-specific distributions (thick lines in B) and thus the category boundary (A). Thus, Jose’s /s/ and /ʃ/ are best classified using his own distributions (purple), in the sense that this leads to a steeper boundary at a different cue value compared to the boundary from the marginal distributions over all talkers (gray) or other males (light blue).
Figure 2.
Figure 2.
Gender-specific distributions of vowel formants for /i/ appear to diverge from the overall (marginal) distributions (A), whereas for VOT the gender-specific distributions are essentially indistinguishable from the marginal distributions. Intuitively, this makes gender informative for vowel formants, but not for VOT (see also vowels in Perry et al., 2001; vs. VOT in Morris, McCrea, & Herring, 2008). The proposed approach formalises this intuition in a quantitative measure that can be applied to directly compare talker variability across different cues, phonetic contrasts, and socio-indexical grouping variables. Vowel data is drawn from the Nationwide Speech Project, and VOT from the Buckeye corpus (see below for more details).
Figure 3.
Figure 3.
Socio-indexical variables are more informative about cue distributions for vowel formants (HN15, Heald & Nusbaum, 2015; NSP, Clopper & Pisoni, 2006b) than for stop voicing (VOT), even after Lobanov normalisation. On top of this, more specific groupings (like Talker and Dialect+Gender) are more informative than broader groupings (Gender). Each open point shows one group (e.g. male for Gender), while shaded points show the average over groups. Gray violins show the null distribution of average informativity (KL estimated from 1000 datasets with randomly permuted group labels), and stars show significance of the variable’s average KL with respect to this null distribution (*p<0.05,**p<0.01,***p<0.001).
Figure 4.
Figure 4.
Individual vowels vary substantially in the informativity of grouping variables about their cue distributions. Only normalised F1×F2 is shown to emphasise dialect effects. Large dots show the average over dialects (+genders), while the small dots show individual dialects (+genders) (see Figure 5 for detailed breakdown of individual dialect effects). The grey violins show the vowel-specific null distributions of the averages, estimated based on 1000 datasets with randomly permuted group labels, and stars show permutation test p value (proportion of random permutations with the same or larger KL divergence), with false discovery rate correction for multiple comparisons (Benjamini & Hochberg, 1995).
Figure 5.
Figure 5.
Breaking down the overall informativity of dialect by individual dialects (left) and dialect-vowel combinations (right). Some dialects are more informative about Lobanov-normalised vowel distributions than random groupings of the same number of talkers (grey violins), but some are not (at least in the current sample of talkers). Likewise for individual vowels within dialects. Moreover, dialects be informative on average but not have any individual vowels that are informative alone (e.g. South), and vice-versa (e.g. Midland). Stars show p values from permutation test (*p<0.05,**p<0.01,***p<0.001) corrected for false-discovery rate across all dialects/dialect-vowel combinations (Benjamini & Hochberg, 1995).
Figure 6.
Figure 6.
Average information-gain in log-odds relative to chance (top) measures the utility of each grouping variable. Bottom shows posterior probability of correct category for comparison. Small points show individual talkers. Large points and lines show mean and bootstrapped 95% CIs over talkers (see text for details).
Figure 7.
Figure 7.
The advantage of knowing a talker’s dialect varies by dialect. Knowing a talker comes from the North regions provides a consistent benefit, regardless of cues (Hz or Lobanov-normalised) or baseline (marginal or gender). Otherwise, dialect does not provide consistent information gain except when using Lobanov-normalised cue values, and even then it varies by dialect. Each point shows one talker, the error bars bootstrapped 95% CIs by talker, and the stars bootstrapped p-values adjusted for false discovery rate (Benjamini & Hochberg, 1995).
Figure 8.
Figure 8.
The information gained from knowing a talker’s dialect also varies by the particular vowel. Vowels undergoing active sound change in multple dialects of American English (like /æ/, /ɛ/, /ɑ/, and /u/) tend to benefit more from knowing dialect. (Single talker estimates of information gain are not shown because the small sample size n ≤ 5 for individual talkers makes them numerically unstable, while the overall log-odds ratios calculated from the mean accuarcies are more stable.) CIs are 95% bootstrapped CIs for the mean over talkers. All p>0.01 (corrected for false discovery rate), and whether an individual p value is less or greater than p=0.05 is sensitive to the bootstrap and subsampling randomisation so stars are not shown.

Similar articles

Cited by

References

    1. Adank P, Smits R, & van Hout R (2004). A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America, 116(5), 3099–3107. doi:10.1121/1.1795335 - DOI - PubMed
    1. Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, & Chang W (2017). Rmarkdown: Dynamic documents for r. R package version 1.7. Retrieved from https://CRAN.R-project.org/package=rmarkdown.
    1. Allen JS, Miller JL, & DeSteno D (2003). Individual talker differences in voice-onset-time. The Journal of the Acoustical Society of America, 113(1), 544–552. doi:10.1121/1.1528172 - DOI - PubMed
    1. Bejjanki VR, Clayards M, Knill DC, & Aslin RN (2011). Cue integration in categorical tasks: Insights from audio-visual speech perception. PLoS ONE, 6(5), e19812. doi:10.1371/journal.pone.0019812 - DOI - PMC - PubMed
    1. Benjamini Y, & Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300. JSTOR: 2346101

LinkOut - more resources