Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories

Vikram Ramanarayanan¹, Maarten Van Segbroeck¹, Shrikanth S Narayanan¹

Affiliations

PMID: 26688612
PMCID: PMC4681009
DOI: 10.1016/j.csl.2015.03.004

Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories

Vikram Ramanarayanan et al. Comput Speech Lang. 2016.

. 2016 Mar 1:36:330-346.

doi: 10.1016/j.csl.2015.03.004. Epub 2015 Mar 21.

Authors

Vikram Ramanarayanan¹, Maarten Van Segbroeck¹, Shrikanth S Narayanan¹

Affiliation

¹ Signal Analysis and Interpretation Lab, University of Southern California, Los Angeles, CA - 90089.

PMID: 26688612
PMCID: PMC4681009
DOI: 10.1016/j.csl.2015.03.004

Abstract

How the speech production and perception systems evolved in humans still remains a mystery today. Previous research suggests that human auditory systems are able, and have possibly evolved, to preserve maximal information about the speaker's articulatory gestures. This paper attempts an initial step towards answering the complementary question of whether speakers' articulatory mechanisms have also evolved to produce sounds that can be optimally discriminated by the listener's auditory system. To this end we explicitly model, using computational methods, the extent to which derived representations of "primitive movements" of speech articulation can be used to discriminate between broad phone categories. We extract interpretable spatio-temporal primitive movements as recurring patterns in a data matrix of human speech articulation, i.e. representing the trajectories of vocal tract articulators over time. To this end, we propose a weakly-supervised learning method that attempts to find a part-based representation of the data in terms of recurring basis trajectory units (or primitives) and their corresponding activations over time. For each phone interval, we then derive a feature representation that captures the co-occurrences between the activations of the various bases over different time-lags. We show that this feature, derived entirely from activations of these primitive movements, is able to achieve a greater discrimination relative to using conventional features on an interval-based phone classification task. We discuss the implications of these findings in furthering our understanding of speech signal representations and the links between speech production and perception systems.

Keywords: information transfer; motor theory; movement primitives; phone classification; speech communication.

PubMed Disclaimer

Figures

**Figure 1**
Schematic of the experimental setup. The input matrix V is constructed from real (EMA) articulatory data. In this example, we assume that there are M = 7 articulator fleshpoint trajectories. We would like to find K = 5 basis functions or articulatory primitives, collectively depicted as the big red cuboid (representing a three-dimensional matrix W). Each vertical slab of the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents a single component of the 3^rd primitive that corresponds to the first articulator (T samples long). The activation of each of these 5 time-varying primitives/basis functions is given by the rows of the activation matrix H in the bottom right hand corner. For instance, the 5 values in the *t^th* column of H are the weights which multiply each of the 5 primitives at the *t^th* time sample. The activation matrix is used as input to the classification module, which consists of 3 steps – (i) dimensionality reduction using agglomerative information bottleneck (AIB) clustering, (ii) conversion to a histogram of cooccurrence (HAC) representation to capture dependence information across timeseries, and (iii) a final support vector (SVM) classifier.

**Figure 2**
Schematic illustrating how shifted and scaled primitives can additively reconstruct the original input data sequence. Each gold square in the topmost row represents one column vector of the input data matrix, V, corresponding to a single sampling instant in time. Recall that our basis functions/primitives are time-varying. Hence, at any given time instant t, we plot only the basis functions/primitives that have non-zero activation (i.e., the corresponding rows of the activation matrix at time t has non-zero entries). Notice that any given basis function extends T = 4 samples long in time, represented by a sequence of 4 silver/gray squares each. Thus, in order to reconstruct say the 4th column of V, we need to consider the contributions of all basis functions that are “active” starting anywhere between time instant 1 to 4, as shown.

**Figure 3**
Root mean squared error (RMSE) for each articulator and broad phone class obtained as a result of running the algorithm on all 460 sentences spoken by male speaker *msak0*.

**Figure 4**
Root mean squared error (RMSE) for each articulator and broad phone class obtained as a result of running the algorithm on all 460 sentences spoken by male speaker *fsew0*.

**Figure 5**
Histograms of the fraction of variance unexplained (FVU) by the proposed cNMFsc model for MOCHA-TIMIT speakers *msak0* (left) and *fsew0* (right). The samples of the distribution were obtained by computing the FVU for each of the 460 sentences. (The algorithm parameters used in the model were *S_h* = 0.65, K = 40 and T = 10).

**Figure 6**
(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker *msak0* corresponding to different English monophthong (first and third columns) and diphthong (second column) vowels. Each panel is denoted by ARPABET phone symbol. The algorithm parameters used were *S_h* = 0.65, K = 40 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by 10 colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend (see Table 1 for the list of EMA trajectory variables).

**Figure 7**
(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker *msak0* corresponding to stop (first two rows), nasal (third row) and approximant (last row) consonants. All rows except the last are arranged in order of labial, coronal and dorsal consonant, respectively. Each panel is denoted by ARPABET phone symbol. The algorithm parameters used were *S_h* = 0.65, K = 40 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by 10 colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.

**Figure 8**
(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker *msak0* corresponding to fricatives and affricate consonants. Each panel is denoted by ARPABET phone symbol. The algorithm parameters used were *S_h* = 0.65, K = 8 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by 10 colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.

**Figure 9**
Mutual information I(𝒜;ℋ̂) between quantized activation space ℋ̂ and the space of acoustic features 𝒜 as a function of the cardinality of ℋ̂ (in other words, the number of quantization levels).

**Figure 10**
Schematic depiction of the computation of histograms of articulatory cooccurrences (HAC) representations. For a chosen lag value, τ, and a time-step t, if we find labels m and n occurring τ time-steps apart (marked in gold), we mark the entry of the lag-τ cooccurrence matrix corresponding to row (m, n) and the *t^th* column with a 1 (corresponding entry also marked in gold). We sum across the columns of this matrix (across time) to obtain the lag-τ HAC representation.

See this image and copyright information in PMC

References

1. Akaike H. Likelihood of a model and information criteria. Journal of Econometrics. 1981;16(1):3–14.
1. Arora R, Livescu K. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains; Int. Conf. on Acoustics, Speech, and Signal Processing; 2013.
1. Atal B. Automatic speech recognition: A communication perspective. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings. 1999;1:457–460.
1. Atal BS, Chang J, Mathews MV, Tukey JW. Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. The Journal of the Acoustical Society of America. 1978;63(5):1535–1555. - PubMed
1. Bertrand A, Demuynck K, Stouten V, Van hamme H. Unsupervised learning of auditory filter banks using non-negative matrix factorisation; IEEE International Conference on Acoustics, Speech and Signal Processing; 2008. pp. 4713–4716.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories

Affiliation

Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories

Authors

Affiliation

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources