Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 14;26(11):108204.
doi: 10.1016/j.isci.2023.108204. eCollection 2023 Nov 17.

Beyond speech: Exploring diversity in the human voice

Affiliations

Beyond speech: Exploring diversity in the human voice

Andrey Anikin et al. iScience. .

Abstract

Humans have evolved voluntary control over vocal production for speaking and singing, while preserving the phylogenetically older system of spontaneous nonverbal vocalizations such as laughs and screams. To test for systematic acoustic differences between these vocal domains, we analyzed a broad, cross-cultural corpus representing over 2 h of speech, singing, and nonverbal vocalizations. We show that, while speech is relatively low-pitched and tonal with mostly regular phonation, singing and especially nonverbal vocalizations vary enormously in pitch and often display harsh-sounding, irregular phonation owing to nonlinear phenomena. The evolution of complex supralaryngeal articulatory spectro-temporal modulation has been critical for speech, yet has not significantly constrained laryngeal source modulation. In contrast, articulation is very limited in nonverbal vocalizations, which predominantly contain minimally articulated open vowels and rapid temporal modulation in the roughness range. We infer that vocal source modulation works best for conveying affect, while vocal filter modulation mainly facilitates semantic communication.

Keywords: Biological sciences; Evolutionary biology; Natural sciences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
Vocal source properties in speech, singing, and nonverbal vocalizations (A) Neutral speech occupies a small subregion of anatomically possible pitch modulation, shown here as scatterplots of minimum by maximum fo values per recording, separately for male and female speakers. Contours enclose the entire observed range within each category and sex. (B) Typical values of voice pitch descriptives vary among speech, singing and nonverbal vocalizations: fitted values from mixed models (medians of posterior distribution and 95% CI). Median = median fo in octaves above C0 (16 Hz); range = fo range, octaves; slope = mean absolute slope of fo, octaves/s; inflections = number of fo inflections per second. (C) Typical proportions of voiced frames affected by various nonlinear phenomena are nearly ten times higher in nonverbal vocalizations compared to neutral speech (medians of posterior distributions and 95% CI shown for the most common types).
Figure 2
Figure 2
Filter properties in speech and nonverbal vocalizations (A) Nonverbal vocalizations mostly contain open vowels, especially [a]. Color gradients show distribution densities for vowels in speech (gray, data taken from Hillenbrand’s corpus10) and in nonverbal vocalizations (blue). The text labels correspond to vowel centroids, while contour lines show the areas containing different proportions of observations. Formants F1 and F2 are normalized to an apparent vocal tract length of 17 cm to make the formant space sex- and speaker-size-invariant. (B) Formants can be thought of as bar codes capable of encoding more information than does voice pitch. Spectrograms of a nonverbal vocalization (above) and two vowels by male speakers produced with Gaussian windows of 40 ms and 25 ms, respectively. Harmonics of fo are redundant in the sense that a single number (fo) encodes the location of all spectral peaks (note the parallel harmonic tracks in the first vocalization). In contrast, formant frequencies can encode more information than fo does because they can vary relatively independently (note the non-parallel formant tracks in the vowels), and such variation is meaningful.
Figure 3
Figure 3
The spectro-temporal modulation spectrum (A) A synthetic laugh-like sound created with soundgen, which has upward fo contours (100–300 Hz) in every syllable, static equidistant formants (500 Hz, 1500 Hz, …), and amplitude modulation at 40 Hz in the last few syllables. This same sound is shown as a spectrogram with its corresponding waveform below, and then as a modulation spectrum created with soundgen::modulationSpectrum using a window length of 15 ms and a step of 5 ms. (B) A conceptual illustration of the nature of a modulation spectrum. Treating the spectrogram as an image, the modulation spectrum represents it as a combination of horizontal, vertical, and slanting ripples or grids with different spacings, which acoustically correspond to regularly repeated spectro-temporal patterns. AM = amplitude modulation, FM = frequency modulation, FT = Fourier transform, STFT = short-time Fourier transform.
Figure 4
Figure 4
Differences between spectro-temporal modulation spectra of speech and nonverbal vocalizations Log-ratios of normalized modulation spectra averaged within each category (see STAR Methods). For instance, speech has pronounced articulation-related amplitude modulation under 20 Hz compared to nonverbal vocalizations (yellow and red), while nonverbal vocalizations have strong modulation in the roughness zone above 50 Hz (blue).

References

    1. Fitch T. Cambridge University Press; 2010. The Evolution of Language.
    1. Anikin A., Bååth R., Persson T. Human non-linguistic vocal repertoire: call types and their meaning. J. Nonverbal Behav. 2018;42:53–80. - PMC - PubMed
    1. Grawunder S. Frank & Timme GmbH; 2009. On the Physiology of Voice Production in South-Siberian Throat Singing: Analysis of Acoustic and Electrophysiological Evidences.
    1. Meyer J. Typology and acoustic strategies of whistled languages: Phonetic comparison and perceptual cues of whistled vowels. J. Int. Phon. Assoc. 2008;38:69–94.
    1. Anikin A. Soundgen: an open-source tool for synthesizing nonverbal vocalizations. Behav. Res. Methods. 2019;51:778–792. - PMC - PubMed

LinkOut - more resources