Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 1:36:330-346.
doi: 10.1016/j.csl.2015.03.004. Epub 2015 Mar 21.

Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories

Affiliations

Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories

Vikram Ramanarayanan et al. Comput Speech Lang. .

Abstract

How the speech production and perception systems evolved in humans still remains a mystery today. Previous research suggests that human auditory systems are able, and have possibly evolved, to preserve maximal information about the speaker's articulatory gestures. This paper attempts an initial step towards answering the complementary question of whether speakers' articulatory mechanisms have also evolved to produce sounds that can be optimally discriminated by the listener's auditory system. To this end we explicitly model, using computational methods, the extent to which derived representations of "primitive movements" of speech articulation can be used to discriminate between broad phone categories. We extract interpretable spatio-temporal primitive movements as recurring patterns in a data matrix of human speech articulation, i.e. representing the trajectories of vocal tract articulators over time. To this end, we propose a weakly-supervised learning method that attempts to find a part-based representation of the data in terms of recurring basis trajectory units (or primitives) and their corresponding activations over time. For each phone interval, we then derive a feature representation that captures the co-occurrences between the activations of the various bases over different time-lags. We show that this feature, derived entirely from activations of these primitive movements, is able to achieve a greater discrimination relative to using conventional features on an interval-based phone classification task. We discuss the implications of these findings in furthering our understanding of speech signal representations and the links between speech production and perception systems.

Keywords: information transfer; motor theory; movement primitives; phone classification; speech communication.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of the experimental setup. The input matrix V is constructed from real (EMA) articulatory data. In this example, we assume that there are M = 7 articulator fleshpoint trajectories. We would like to find K = 5 basis functions or articulatory primitives, collectively depicted as the big red cuboid (representing a three-dimensional matrix W). Each vertical slab of the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents a single component of the 3rd primitive that corresponds to the first articulator (T samples long). The activation of each of these 5 time-varying primitives/basis functions is given by the rows of the activation matrix H in the bottom right hand corner. For instance, the 5 values in the tth column of H are the weights which multiply each of the 5 primitives at the tth time sample. The activation matrix is used as input to the classification module, which consists of 3 steps – (i) dimensionality reduction using agglomerative information bottleneck (AIB) clustering, (ii) conversion to a histogram of cooccurrence (HAC) representation to capture dependence information across timeseries, and (iii) a final support vector (SVM) classifier.
Figure 2
Figure 2
Schematic illustrating how shifted and scaled primitives can additively reconstruct the original input data sequence. Each gold square in the topmost row represents one column vector of the input data matrix, V, corresponding to a single sampling instant in time. Recall that our basis functions/primitives are time-varying. Hence, at any given time instant t, we plot only the basis functions/primitives that have non-zero activation (i.e., the corresponding rows of the activation matrix at time t has non-zero entries). Notice that any given basis function extends T = 4 samples long in time, represented by a sequence of 4 silver/gray squares each. Thus, in order to reconstruct say the 4th column of V, we need to consider the contributions of all basis functions that are “active” starting anywhere between time instant 1 to 4, as shown.
Figure 3
Figure 3
Root mean squared error (RMSE) for each articulator and broad phone class obtained as a result of running the algorithm on all 460 sentences spoken by male speaker msak0.
Figure 4
Figure 4
Root mean squared error (RMSE) for each articulator and broad phone class obtained as a result of running the algorithm on all 460 sentences spoken by male speaker fsew0.
Figure 5
Figure 5
Histograms of the fraction of variance unexplained (FVU) by the proposed cNMFsc model for MOCHA-TIMIT speakers msak0 (left) and fsew0 (right). The samples of the distribution were obtained by computing the FVU for each of the 460 sentences. (The algorithm parameters used in the model were Sh = 0.65, K = 40 and T = 10).
Figure 6
Figure 6
(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker msak0 corresponding to different English monophthong (first and third columns) and diphthong (second column) vowels. Each panel is denoted by ARPABET phone symbol. The algorithm parameters used were Sh = 0.65, K = 40 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by 10 colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend (see Table 1 for the list of EMA trajectory variables).
Figure 7
Figure 7
(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker msak0 corresponding to stop (first two rows), nasal (third row) and approximant (last row) consonants. All rows except the last are arranged in order of labial, coronal and dorsal consonant, respectively. Each panel is denoted by ARPABET phone symbol. The algorithm parameters used were Sh = 0.65, K = 40 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by 10 colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.
Figure 8
Figure 8
(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker msak0 corresponding to fricatives and affricate consonants. Each panel is denoted by ARPABET phone symbol. The algorithm parameters used were Sh = 0.65, K = 8 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by 10 colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.
Figure 9
Figure 9
Mutual information I(𝒜;ℋ̂) between quantized activation space ℋ̂ and the space of acoustic features 𝒜 as a function of the cardinality of ℋ̂ (in other words, the number of quantization levels).
Figure 10
Figure 10
Schematic depiction of the computation of histograms of articulatory cooccurrences (HAC) representations. For a chosen lag value, τ, and a time-step t, if we find labels m and n occurring τ time-steps apart (marked in gold), we mark the entry of the lag-τ cooccurrence matrix corresponding to row (m, n) and the tth column with a 1 (corresponding entry also marked in gold). We sum across the columns of this matrix (across time) to obtain the lag-τ HAC representation.

References

    1. Akaike H. Likelihood of a model and information criteria. Journal of Econometrics. 1981;16(1):3–14.
    1. Arora R, Livescu K. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains; Int. Conf. on Acoustics, Speech, and Signal Processing; 2013.
    1. Atal B. Automatic speech recognition: A communication perspective. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings. 1999;1:457–460.
    1. Atal BS, Chang J, Mathews MV, Tukey JW. Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique. The Journal of the Acoustical Society of America. 1978;63(5):1535–1555. - PubMed
    1. Bertrand A, Demuynck K, Stouten V, Van hamme H. Unsupervised learning of auditory filter banks using non-negative matrix factorisation; IEEE International Conference on Acoustics, Speech and Signal Processing; 2008. pp. 4713–4716.