Beyond speech: Exploring diversity in the human voice

Andrey Anikin^{1

2}, Valentina Canessa-Pollard^{2

3}, Katarzyna Pisanski^{2

4

5}, Mathilde Massenet², David Reby²

Affiliations

¹ Division of Cognitive Science, Lund University, Lund, Sweden.
² ENES Bioacoustics Research Lab, CRNL, University of Saint-Etienne, CNRS, Inserm, 23 rue Michelon, 42023 Saint-Etienne, France.
³ Psychology, Institute of Psychology, Business and Human Sciences, University of Chichester, Chichester, West Sussex PO19 6PE, UK.
⁴ CNRS French National Centre for Scientific Research, DDL Dynamics of Language Lab, University of Lyon 2, 69007 Lyon, France.
⁵ Institute of Psychology, University of Wrocław, Dawida 1, 50-527 Wrocław, Poland.

PMID: 37908309
PMCID: PMC10613903
DOI: 10.1016/j.isci.2023.108204

Beyond speech: Exploring diversity in the human voice

Andrey Anikin et al. iScience. 2023.

. 2023 Oct 14;26(11):108204.

doi: 10.1016/j.isci.2023.108204. eCollection 2023 Nov 17.

Authors

Andrey Anikin^{1

2}, Valentina Canessa-Pollard^{2

3}, Katarzyna Pisanski^{2

4

5}, Mathilde Massenet², David Reby²

Affiliations

¹ Division of Cognitive Science, Lund University, Lund, Sweden.
² ENES Bioacoustics Research Lab, CRNL, University of Saint-Etienne, CNRS, Inserm, 23 rue Michelon, 42023 Saint-Etienne, France.
³ Psychology, Institute of Psychology, Business and Human Sciences, University of Chichester, Chichester, West Sussex PO19 6PE, UK.
⁴ CNRS French National Centre for Scientific Research, DDL Dynamics of Language Lab, University of Lyon 2, 69007 Lyon, France.
⁵ Institute of Psychology, University of Wrocław, Dawida 1, 50-527 Wrocław, Poland.

PMID: 37908309
PMCID: PMC10613903
DOI: 10.1016/j.isci.2023.108204

Abstract

Humans have evolved voluntary control over vocal production for speaking and singing, while preserving the phylogenetically older system of spontaneous nonverbal vocalizations such as laughs and screams. To test for systematic acoustic differences between these vocal domains, we analyzed a broad, cross-cultural corpus representing over 2 h of speech, singing, and nonverbal vocalizations. We show that, while speech is relatively low-pitched and tonal with mostly regular phonation, singing and especially nonverbal vocalizations vary enormously in pitch and often display harsh-sounding, irregular phonation owing to nonlinear phenomena. The evolution of complex supralaryngeal articulatory spectro-temporal modulation has been critical for speech, yet has not significantly constrained laryngeal source modulation. In contrast, articulation is very limited in nonverbal vocalizations, which predominantly contain minimally articulated open vowels and rapid temporal modulation in the roughness range. We infer that vocal source modulation works best for conveying affect, while vocal filter modulation mainly facilitates semantic communication.

Keywords: Biological sciences; Evolutionary biology; Natural sciences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Vocal source properties in speech, singing, and nonverbal vocalizations (A) Neutral speech occupies a small subregion of anatomically possible pitch modulation, shown here as scatterplots of minimum by maximum f_o values per recording, separately for male and female speakers. Contours enclose the entire observed range within each category and sex. (B) Typical values of voice pitch descriptives vary among speech, singing and nonverbal vocalizations: fitted values from mixed models (medians of posterior distribution and 95% CI). Median = median f_o in octaves above C0 (16 Hz); range = f_o range, octaves; slope = mean absolute slope of f_o, octaves/s; inflections = number of f_o inflections per second. (C) Typical proportions of voiced frames affected by various nonlinear phenomena are nearly ten times higher in nonverbal vocalizations compared to neutral speech (medians of posterior distributions and 95% CI shown for the most common types).

**Figure 2**
Filter properties in speech and nonverbal vocalizations (A) Nonverbal vocalizations mostly contain open vowels, especially [a]. Color gradients show distribution densities for vowels in speech (gray, data taken from Hillenbrand’s corpus¹⁰) and in nonverbal vocalizations (blue). The text labels correspond to vowel centroids, while contour lines show the areas containing different proportions of observations. Formants F1 and F2 are normalized to an apparent vocal tract length of 17 cm to make the formant space sex- and speaker-size-invariant. (B) Formants can be thought of as bar codes capable of encoding more information than does voice pitch. Spectrograms of a nonverbal vocalization (above) and two vowels by male speakers produced with Gaussian windows of 40 ms and 25 ms, respectively. Harmonics of f_o are redundant in the sense that a single number (f_o) encodes the location of all spectral peaks (note the parallel harmonic tracks in the first vocalization). In contrast, formant frequencies can encode more information than f_o does because they can vary relatively independently (note the non-parallel formant tracks in the vowels), and such variation is meaningful.

**Figure 3**
The spectro-temporal modulation spectrum (A) A synthetic laugh-like sound created with *soundgen*, which has upward fo contours (100–300 Hz) in every syllable, static equidistant formants (500 Hz, 1500 Hz, …), and amplitude modulation at 40 Hz in the last few syllables. This same sound is shown as a spectrogram with its corresponding waveform below, and then as a modulation spectrum created with *soundgen::modulationSpectrum* using a window length of 15 ms and a step of 5 ms. (B) A conceptual illustration of the nature of a modulation spectrum. Treating the spectrogram as an image, the modulation spectrum represents it as a combination of horizontal, vertical, and slanting ripples or grids with different spacings, which acoustically correspond to regularly repeated spectro-temporal patterns. AM = amplitude modulation, FM = frequency modulation, FT = Fourier transform, STFT = short-time Fourier transform.

**Figure 4**
Differences between spectro-temporal modulation spectra of speech and nonverbal vocalizations Log-ratios of normalized modulation spectra averaged within each category (see STAR Methods). For instance, speech has pronounced articulation-related amplitude modulation under 20 Hz compared to nonverbal vocalizations (yellow and red), while nonverbal vocalizations have strong modulation in the roughness zone above 50 Hz (blue).

See this image and copyright information in PMC

References

1. Fitch T. Cambridge University Press; 2010. The Evolution of Language.
1. Anikin A., Bååth R., Persson T. Human non-linguistic vocal repertoire: call types and their meaning. J. Nonverbal Behav. 2018;42:53–80. - PMC - PubMed
1. Grawunder S. Frank & Timme GmbH; 2009. On the Physiology of Voice Production in South-Siberian Throat Singing: Analysis of Acoustic and Electrophysiological Evidences.
1. Meyer J. Typology and acoustic strategies of whistled languages: Phonetic comparison and perceptual cues of whistled vowels. J. Int. Phon. Assoc. 2008;38:69–94.
1. Anikin A. Soundgen: an open-source tool for synthesizing nonverbal vocalizations. Behav. Res. Methods. 2019;51:778–792. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Beyond speech: Exploring diversity in the human voice

Affiliations

Beyond speech: Exploring diversity in the human voice

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources