Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 26;7(6):e00665.
doi: 10.1002/brb3.665. eCollection 2017 Jun.

Vowel decoding from single-trial speech-evoked electrophysiological responses: A feature-based machine learning approach

Affiliations

Vowel decoding from single-trial speech-evoked electrophysiological responses: A feature-based machine learning approach

Han G Yi et al. Brain Behav. .

Abstract

Introduction: Scalp-recorded electrophysiological responses to complex, periodic auditory signals reflect phase-locked activity from neural ensembles within the auditory system. These responses, referred to as frequency-following responses (FFRs), have been widely utilized to index typical and atypical representation of speech signals in the auditory system. One of the major limitations in FFR is the low signal-to-noise ratio at the level of single trials. For this reason, the analysis relies on averaging across thousands of trials. The ability to examine the quality of single-trial FFRs will allow investigation of trial-by-trial dynamics of the FFR, which has been impossible due to the averaging approach.

Methods: In a novel, data-driven approach, we used machine learning principles to decode information related to the speech signal from single trial FFRs. FFRs were collected from participants while they listened to two vowels produced by two speakers. Scalp-recorded electrophysiological responses were projected onto a low-dimensional spectral feature space independently derived from the same two vowels produced by 40 speakers, which were not presented to the participants. A novel supervised machine learning classifier was trained to discriminate vowel tokens on a subset of FFRs from each participant, and tested on the remaining subset.

Results: We demonstrate reliable decoding of speech signals at the level of single-trials by decomposing the raw FFR based on information-bearing spectral features in the speech signal that were independently derived.

Conclusions: Taken together, the ability to extract interpretable features at the level of single-trials in a data-driven manner offers unchartered possibilities in the noninvasive assessment of human auditory function.

Keywords: EEG; frequency‐following responses; speech decoding; vowels.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(a) Spectra for [æ] and [u] vowels produced by two male native speakers of English. The x‐axis codes frequency ranging from 0 to 4 kHz, in 4‐Hz steps. The y‐axis codes relative amplitude at each spectral bin, which has been scaled by the standard deviation of each of the four sound files. (b) Spectra for the frequency following responses collected from 25 participants, which were averaged across 1,000 trials. The x‐ and y‐axes are identical to those used in (a). (c) Overlaying the two sets of spectra reveals spectral similarity across the stimuli and the responses within each speech token
Figure 2
Figure 2
Spectral projection of the single‐trial frequency following responses (FFRs) onto the spectral feature space. Figures are derived from a representative participant. The raw FFR spectra (left) were multiplied by a matrix of 12 vectors (center) corresponding to the top principal components independently derived from spectra of [æ] and [u] vowels produced by 40 male native speakers. This procedure resulted in projection of the raw FFRs onto the 12‐dimensional spectral feature space (right)
Figure 3
Figure 3
(a) Training‐test scheme for vowel (N = 2) decoding. For each participant, a classifier was trained to identify the [æ] and [u] labels from each trial, based on the 12 spectral features. Then, the trained classifier was tested on an independent subset. The resulting prediction vector included values pertaining to the probability of each vowel. In this particular example from a representative participant, the classifier outputs reasonably accurate responses for [æ]1 and [u]2, but not for [æ]2 and [u]1. (b) Training‐test scheme for stimulus (N = 4) decoding. For each participant, a classifier was trained to identify for [æ]1, [æ]2, and [u]1, and [u]2 labels from each trial, based on the 12 spectral features. Then, the trained classifier was tested on an independent subset. The resulting prediction vector included values pertaining to the probability of each of the four stimuli. In this particular example from a representative participant, the classifier outputs reasonably accurate responses for [æ]1 and [æ]2, but not for [u]1 and [u]2. (c) Based on the aforementioned probability vectors, a receiver operating characteristics (ROC) curve was generated. The area under the curve (AUC) measure served as a metric of decoding performance. Note that for stimulus decoding, the ROC curve was constructed separately for each stimulus per a one‐versus‐all scheme
Figure 4
Figure 4
(a) Area under the curve (AUC) measures are displayed for vowel (mean = 0.67; SD = 0.15; median = 0.67) and stimulus (mean = 0.73; SD = 0.09; median = 0.71) decoding. In this box plot, the dark centerlines correspond to the median, while the top and bottom edges of the boxes correspond to the 25th and 75th percentiles across the 25 participants, respectively. Note that stimulus decoding AUC is averaged across individual one‐versus‐all AUC calculated from each of the four stimuli, the chance level therefore corresponding to 0.50 rather than 0.25. (b) Vowel and stimulus decoding AUC across different sizes of the training set. The x‐axis corresponds to the trials per each of the four stimuli (from 50 to 950; step size of 50 trials) that were included as a part of the subset included in training of the classifier. Note that the test set always consisted of the 50 trials per stimulus that immediately followed the training set
Figure 5
Figure 5
(a) Importance of spectral features during vowel and stimulus decoding (950‐trial training set). The x‐axis corresponds to the 12 principal components (PCs) that were used as input features for the classifier. The y‐axis corresponds to the percentage of times in which the feature was used by a given decision tree. (b) The top four PCs in the frequency domain. In PC1, which was disproportionately used by the classifiers, three extrema are readily identifiable (arrows). (c) Log‐transformed spectra of the original stimuli (left; black lines) and the grand average frequency following response (right; red lines) are displayed. Three formant frequencies are identifiable (arrows), which also correspond to the three extrema of the PC1 marked with arrows in (b)

References

    1. Aiken, S. J. , & Picton, T. W. (2008). Envelope and spectral frequency‐following responses to vowel sounds. Hearing Research, 245, 35–47. - PubMed
    1. Anderson, L. A. , & Malmierca, M. S. (2013). The effect of auditory cortex deactivation on stimulus‐specific adaptation in the inferior colliculus of the rat. European Journal of Neuroscience, 37, 52–62. - PubMed
    1. Banai, K. , Abrams, D. , & Kraus, N. (2007). Sensory‐based learning disability: Insights from brainstem processing of speech sounds. International Journal of Audiology, 46, 524–532. - PubMed
    1. Bidelman, G. M. (2014). Objective information‐theoretic algorithm for detecting brainstem‐evoked responses to complex stimuli. Journal of the American Academy of Audiology, 25, 715–726. - PubMed
    1. Bidelman, G. M. , Moreno, S. , & Alain, C. (2013). Tracing the emergence of categorical speech perception in the human auditory system. NeuroImage, 79, 201–212. - PubMed

LinkOut - more resources