. 2018;34(1):43-68.

doi: 10.1080/23273798.2018.1500698. Epub 2018 Jul 30.

Structure in talker variability: How much is there and how much can it help?

Dave F Kleinschmidt^{1

2}

Affiliations

¹ Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA.
² Department of Brain and Cognitive Sciences, University of Rochester, New York, NY, USA.

PMID: 30619905
PMCID: PMC6320234
DOI: 10.1080/23273798.2018.1500698

Structure in talker variability: How much is there and how much can it help?

Dave F Kleinschmidt. Lang Cogn Neurosci. 2018.

. 2018;34(1):43-68.

doi: 10.1080/23273798.2018.1500698. Epub 2018 Jul 30.

Author

Dave F Kleinschmidt^{1

2}

Affiliations

¹ Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA.
² Department of Brain and Cognitive Sciences, University of Rochester, New York, NY, USA.

PMID: 30619905
PMCID: PMC6320234
DOI: 10.1080/23273798.2018.1500698

Abstract

One of the persistent puzzles in understanding human speech perception is how listeners cope with talker variability. One thing that might help listeners is structure in talker variability: rather than varying randomly, talkers of the same gender, dialect, age, etc. tend to produce language in similar ways. Listeners are sensitive to this covariation between linguistic variation and socio-indexical variables. In this paper I present new techniques based on ideal observer models to quantify (1) the amount and type of structure in talker variation (informativity of a grouping variable), and (2) how useful such structure can be for robust speech recognition in the face of talker variability (the utility of a grouping variable). I demonstrate these techniques in two phonetic domains-word-initial stop voicing and vowel identity-and show that these domains have different amounts and types of talker variability, consistent with previous, impressionistic findings. An R package (phondisttools) accompanies this paper, and the source and data are available from osf.io/zv6e3.

Keywords: Speech perception; computational modelling; variability.

PubMed Disclaimer

Conflict of interest statement

Disclosure statement No potential conflict of interest was reported by the authors.

Figures

**Figure 1.**
How well a listener can recognise the phonetic category (A, e.g. /s/ vs. /ʃ/; loosely based on Newman et al., 2001) a talker is producing depends on what the listener knows about the underlying cue distributions (B). These distributions *vary* across talkers, which results in variability in the best category boundary. Each talker’s cue distributions can be characterised by their *parameters* (C; e.g. the mean of /s/, mean of /ʃ/, variance of /s/, etc.; together denoted θ). Each point in C corresponds to a pair of distributions in B and one category boundary in A. *Groups* of talkers are thus distributions in this high-dimensional space (C, ellipses); marginalising (averaging) over a group smears out the category-specific distributions (thick lines in B) and thus the category boundary (A). Thus, Jose’s /s/ and /ʃ/ are best classified using his own distributions (purple), in the sense that this leads to a steeper boundary at a different cue value compared to the boundary from the *marginal* distributions over all talkers (gray) or other males (light blue).

**Figure 2.**
Gender-specific distributions of vowel formants for /i/ appear to diverge from the overall (marginal) distributions (A), whereas for VOT the gender-specific distributions are essentially indistinguishable from the marginal distributions. Intuitively, this makes gender informative for vowel formants, but not for VOT (see also vowels in Perry et al., 2001; vs. VOT in Morris, McCrea, & Herring, 2008). The proposed approach formalises this intuition in a quantitative measure that can be applied to directly compare talker variability across different cues, phonetic contrasts, and socio-indexical grouping variables. Vowel data is drawn from the Nationwide Speech Project, and VOT from the Buckeye corpus (see below for more details).

**Figure 3.**
Socio-indexical variables are more informative about cue distributions for vowel formants (HN15, Heald & Nusbaum, 2015; NSP, Clopper & Pisoni, 2006b) than for stop voicing (VOT), even after Lobanov normalisation. On top of this, more specific groupings (like Talker and Dialect+Gender) are more informative than broader groupings (Gender). Each open point shows one group (e.g. *male* for *Gender*), while shaded points show the average over groups. Gray violins show the null distribution of average informativity (KL estimated from 1000 datasets with randomly permuted group labels), and stars show significance of the variable’s average KL with respect to this null distribution (*p<0.05,**p<0.01,***p<0.001).

**Figure 4.**
Individual vowels vary substantially in the informativity of grouping variables about their cue distributions. Only normalised F1×F2 is shown to emphasise dialect effects. Large dots show the average over dialects (+genders), while the small dots show individual dialects (+genders) (see Figure 5 for detailed breakdown of individual dialect effects). The grey violins show the vowel-specific null distributions of the averages, estimated based on 1000 datasets with randomly permuted group labels, and stars show permutation test p value (proportion of random permutations with the same or larger KL divergence), with false discovery rate correction for multiple comparisons (Benjamini & Hochberg, 1995).

**Figure 5.**
Breaking down the overall informativity of dialect by individual dialects (left) and dialect-vowel combinations (right). Some dialects are more informative about Lobanov-normalised vowel distributions than random groupings of the same number of talkers (grey violins), but some are not (at least in the current sample of talkers). Likewise for individual vowels within dialects. Moreover, dialects be informative on average but not have any individual vowels that are informative alone (e.g. South), and vice-versa (e.g. Midland). Stars show p values from permutation test (*p<0.05,**p<0.01,***p<0.001) corrected for false-discovery rate across all dialects/dialect-vowel combinations (Benjamini & Hochberg, 1995).

**Figure 6.**
Average information-gain in log-odds relative to chance (top) measures the utility of each grouping variable. Bottom shows posterior probability of correct category for comparison. Small points show individual talkers. Large points and lines show mean and bootstrapped 95% CIs over talkers (see text for details).

**Figure 7.**
The advantage of knowing a talker’s dialect varies by dialect. Knowing a talker comes from the North regions provides a consistent benefit, regardless of cues (Hz or Lobanov-normalised) or baseline (marginal or gender). Otherwise, dialect does not provide consistent information gain except when using Lobanov-normalised cue values, and even then it varies by dialect. Each point shows one talker, the error bars bootstrapped 95% CIs by talker, and the stars bootstrapped p-values adjusted for false discovery rate (Benjamini & Hochberg, 1995).

**Figure 8.**
The information gained from knowing a talker’s dialect also varies by the particular vowel. Vowels undergoing active sound change in multple dialects of American English (like /æ/, /ɛ/, /ɑ/, and /u/) tend to benefit more from knowing dialect. (Single talker estimates of information gain are not shown because the small sample size n ≤ 5 for individual talkers makes them numerically unstable, while the overall log-odds ratios calculated from the mean accuarcies are more stable.) CIs are 95% bootstrapped CIs for the mean over talkers. All p>0.01 (corrected for false discovery rate), and whether an individual p value is less or greater than p=0.05 is sensitive to the bootstrap and subsampling randomisation so stars are not shown.

See this image and copyright information in PMC

Cited by

Computational Modeling of an Auditory Lexical Decision Experiment Using DIANA.
Nenadić F, Tucker BV, Ten Bosch L. Nenadić F, et al. Lang Speech. 2023 Sep;66(3):564-605. doi: 10.1177/00238309221111752. Epub 2022 Aug 24. Lang Speech. 2023. PMID: 36000386 Free PMC article.
Reliability and validity for perceptual flexibility in speech.
Heffner CC, Fuhrmeister P, Luthra S, Mechtenberg H, Saltzman D, Myers EB. Heffner CC, et al. Brain Lang. 2022 Mar;226:105070. doi: 10.1016/j.bandl.2021.105070. Epub 2022 Jan 10. Brain Lang. 2022. PMID: 35026449 Free PMC article.
Time and information in perceptual adaptation to speech.
Choi JY, Perrachione TK. Choi JY, et al. Cognition. 2019 Nov;192:103982. doi: 10.1016/j.cognition.2019.05.019. Epub 2019 Jun 21. Cognition. 2019. PMID: 31229740 Free PMC article.
Toward "English" Phonetics: Variability in the Pre-consonantal Voicing Effect Across English Dialects and Speakers.
Tanner J, Sonderegger M, Stuart-Smith J, Fruehwald J. Tanner J, et al. Front Artif Intell. 2020 May 29;3:38. doi: 10.3389/frai.2020.00038. eCollection 2020. Front Artif Intell. 2020. PMID: 33733155 Free PMC article.
Gender stereotypes and social perception of vocal confidence is mitigated by salience of socio-indexical cues to gender.
Roche JM, Asaro K, Morris BJ, Morgan SD. Roche JM, et al. Front Psychol. 2023 Dec 14;14:1125164. doi: 10.3389/fpsyg.2023.1125164. eCollection 2023. Front Psychol. 2023. PMID: 38155698 Free PMC article.

See all "Cited by" articles

References

1. Adank P, Smits R, & van Hout R (2004). A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America, 116(5), 3099–3107. doi:10.1121/1.1795335 - DOI - PubMed
1. Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, & Chang W (2017). Rmarkdown: Dynamic documents for r. R package version 1.7. Retrieved from https://CRAN.R-project.org/package=rmarkdown.
1. Allen JS, Miller JL, & DeSteno D (2003). Individual talker differences in voice-onset-time. The Journal of the Acoustical Society of America, 113(1), 544–552. doi:10.1121/1.1528172 - DOI - PubMed
1. Bejjanki VR, Clayards M, Knill DC, & Aslin RN (2011). Cue integration in categorical tasks: Insights from audio-visual speech perception. PLoS ONE, 6(5), e19812. doi:10.1371/journal.pone.0019812 - DOI - PMC - PubMed
1. Benjamini Y, & Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300. JSTOR: 2346101

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- figshare - Access datasets and other research materials.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Structure in talker variability: How much is there and how much can it help?

Affiliations

Structure in talker variability: How much is there and how much can it help?

Author

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources