. 2012 Jan;10(1):e1001251.

doi: 10.1371/journal.pbio.1001251. Epub 2012 Jan 31.

Reconstructing speech from human auditory cortex

Brian N Pasley¹, Stephen V David, Nima Mesgarani, Adeen Flinker, Shihab A Shamma, Nathan E Crone, Robert T Knight, Edward F Chang

Affiliations

PMID: 22303281
PMCID: PMC3269422
DOI: 10.1371/journal.pbio.1001251

Reconstructing speech from human auditory cortex

Brian N Pasley et al. PLoS Biol. 2012 Jan.

. 2012 Jan;10(1):e1001251.

doi: 10.1371/journal.pbio.1001251. Epub 2012 Jan 31.

Authors

Brian N Pasley¹, Stephen V David, Nima Mesgarani, Adeen Flinker, Shihab A Shamma, Nathan E Crone, Robert T Knight, Edward F Chang

Affiliation

¹ Helen Wills Neuroscience Institute, University of California Berkeley, Berkeley, California, United States of America. bpasley@berkeley.edu

PMID: 22303281
PMCID: PMC3269422
DOI: 10.1371/journal.pbio.1001251

Abstract

How the human auditory system extracts perceptually relevant acoustic features of speech is unknown. To address this question, we used intracranial recordings from nonprimary auditory cortex in the human superior temporal gyrus to determine what acoustic information in speech sounds can be reconstructed from population neural activity. We found that slow and intermediate temporal fluctuations, such as those corresponding to syllable rate, were accurately reconstructed using a linear model based on the auditory spectrogram. However, reconstruction of fast temporal fluctuations, such as syllable onsets and offsets, required a nonlinear sound representation based on temporal modulation energy. Reconstruction accuracy was highest within the range of spectro-temporal fluctuations that have been found to be critical for speech intelligibility. The decoded speech representations allowed readout and identification of individual words directly from brain activity during single trial sound presentations. These findings reveal neural encoding mechanisms of speech acoustic parameters in higher order human auditory cortex.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Experiment paradigm.**
Participants listened to words (acoustic waveform, top left), while neural signals were recorded from cortical surface electrode arrays (top right, red circles) implanted over superior and middle temporal gyrus (STG, MTG). Speech-induced cortical field potentials (bottom right, gray curves) recorded at multiple electrode sites were used to fit multi-input, multi-output models for offline decoding. The models take as input time-varying neural signals at multiple electrodes and output a spectrogram consisting of time-varying spectral power across a range of acoustic frequencies (180–7,000 Hz, bottom left). To assess decoding accuracy, the reconstructed spectrogram is compared to the spectrogram of the original acoustic waveform.

**Figure 2. Spectrogram reconstruction.**
(A) Top: spectrogram of six isolated words (deep, jazz, cause) and pseudowords (fook, ors, nim) presented aurally to an individual participant. Bottom: spectrogram-based reconstruction of the same speech segment, linearly decoded from a set of electrodes. Purple and green bars denote vowels and fricative consonants, respectively, and the spectrogram is normalized within each frequency channel for display. (B) Single trial high gamma band power (70–150 Hz, gray curves) induced by the speech segment in (A). Recordings are from four different STG sites used in the reconstruction. The high gamma response at each site is z-scored and plotted in standard deviation (SD) units. Right panel: frequency tuning curves (dark black) for each of the four electrode sites, sorted by peak frequency and normalized by maximum amplitude. Red bars overlay each peak frequency and indicate SEM of the parameter estimate. Frequency tuning was computed from spectro-temporal receptive fields (STRFs) measured at each individual electrode site. Tuning curves exhibit a range of functional forms including multiple frequency peaks (Figures S1B and S2B). (C) The anatomical distribution of fitted weights in the reconstruction model. Dashed box denotes the extent of the electrode grid (shown in Figure 1). Weight magnitudes are averaged over all time lags and spectrogram frequencies and spatially smoothed for display. Nonzero weights are largely focal to STG electrode sites. Scale bar is 10 mm.

**Figure 3. Individual participant and group average reconstruction accuracy.**
(A) Overall reconstruction accuracy for each participant using the linear spectrogram model. Error bars denote resampling SEM. Overall accuracy is reported as the mean over all acoustic frequencies. Participants are grouped by grid density (low or high) and stimulus set (isolated words or sentences). Statistical significance of the correlation coefficient for each individual participant was computed using a randomization test. Reconstructed trials were randomly shuffled 1,000 times and the correlation coefficient was computed for each shuffle to create a null distribution of coefficients. The p value was calculated as the proportion of elements greater than the observed correlation. (B) Reconstruction accuracy as a function of acoustic frequency averaged over all participants (N = 15) using the linear spectrogram model. Shaded region denotes SEM over participants.

**Figure 4. Factors influencing reconstruction quality.**
(A) Group average t value map of informative electrodes, which are predominantly localized to posterior STG. For each participant, informative electrodes are defined as those associated with significant weights (p<0.05, FDR correction) in the fitted reconstruction model. To plot electrodes in a common anatomical space, spatial coordinates of significant electrodes are normalized to the MNI (Montreal Neurological Institute) brain template (Yale BioImage Suite, www.bioimagesuite.org). The dashed white line denotes the extent of electrode coverage pooled over participants. (B) Reconstruction accuracy is significantly greater than zero when using neural responses within the high gamma band (∼70–170 Hz; p<0.05, one sample t tests, df = 14, Bonferroni correction). Accuracy was computed separately in 10 Hz bands from 1–300 Hz and averaged across all participants (N = 15). (C) Mean reconstruction accuracy improves with increasing number of electrodes used in the reconstruction algorithm. Error bars indicate SEM over 20 cross-validated data sets of four participants with 4 mm high density grids. (D) Accuracy across participants is strongly correlated (r = 0.78, p<0.001, df = 13) with tuning spread (which varied by participant depending on grid placement and electrode density). Tuning spread was quantified as the fraction of frequency bins that included one or more peaks, ranging from 0 (no peaks) to 1 (at least one peak in all frequency bins, ranging from 180–7,000 Hz).

**Figure 5. Comparison of linear and nonlinear coding of temporal fluctuations.**
(A) Mean reconstruction accuracy (r) as a function of temporal modulation rate, averaged over all participants (N = 15). Modulation-based decoding accuracy (red curve) is higher compared to spectrogram-based decoding (blue curve) for temporal rates ≥4 Hz. In addition, spectrogram-based decoding accuracy is significantly greater than zero for lower modulation rates (≤8 Hz), supporting the possibility of a dual modulation and envelope-based coding scheme for slow modulation rates. Shaded gray regions indicate SEM over participants. (B) Mean ensemble rate tuning curve across all predictive electrode sites (n = 195). Error bars indicate SEM. Overlaid histograms indicate proportion of sites with peak tuning at each rate. (C) Within-site differences between modulation and spectrogram-based tuning. Arrow indicates the mean difference across sites. Within-site, nonlinear modulation models are tuned to higher temporal modulation rates than the corresponding linear spectrogram models (p<10⁻⁷, two sample paired t test, df = 194).

**Figure 6. Schematic of nonlinear modulation model.**
(A) The input spectrogram (top left) is transformed by a linear modulation filter bank (right) followed by a nonlinear magnitude operation (not shown). This nonlinear operation extracts the modulation energy of the incoming spectrogram and generates phase invariance to local fluctuations in the spectrogram envelope. The input representation is the two-dimensional spectrogram S(*f,t*) across frequency f and time t. The output (bottom left) is the four-dimensional modulation energy representation M(*s,r,f,t*) across spectral modulation scale s, temporal modulation rate r, frequency f, and time t. In the full modulation representation , negative rates by convention correspond to upward frequency sweeps, while positive rates correspond to downward frequency sweeps. Accuracy for positive and negative rates was averaged unless otherwise shown. See Materials and Methods. (B) Schematic of linear (spectrogram envelope) and nonlinear (modulation energy) temporal coding. Left: acoustic waveform (black curve) and spectrogram of a temporally modulated tone. The linear spectrogram model (top) assumes that neural responses are a linear function of the spectrogram envelope (plotted for the tone center frequency channel, top right). In this case, the instantaneous output may be high or low and does not directly indicate the modulation rate of the envelope. The nonlinear modulation model (bottom) assumes that neural responses are a linear function of modulation energy. This is an amplitude-based coding scheme (plotted for the peak modulation channel, bottom right). The nonlinear modulation model explicitly estimates the modulation rate by taking on a constant value for a constant rate .

**Figure 7. Example of nonlinear modulation coding and reconstruction.**
(A) Top: the spectrogram of an isolated word (“waldo”) presented aurally to one participant. Blue curve plots the spectrogram envelope, summed over all frequencies. Left panels: induced high gamma responses (black curves, trial averaged) at four different STG sites. Temporal modulation energy of the stimulus (dashed red curves) is overlaid (computed from 2, 4, 8, and 16 Hz modulation filters and normalized to maximum value). Dashed black lines indicate baseline response level. Right panels: nonlinear modulation rate tuning curves for each site (estimated from nonlinear STRFs). Shaded regions and error bars indicate SEM. (B) Original spectrogram (top), modulation-based reconstruction (middle), and spectrogram-based reconstruction (bottom), linearly decoded from a fixed set of STG electrodes. The modulation reconstruction is projected into the spectrogram domain using an iterative projection algorithm and an overcomplete set of modulation filters . The displayed spectrogram is averaged over 100 random initializations of the algorithm.

**Figure 8. Word identification.**
Word identification based on the reconstructed spectrograms was assessed using a set of 47 individual words and pseudowords from a single speaker in a high density 4 mm grid experiment. The speech recognition algorithm is described in the text. (A) Distribution of identification rank for all 47 words in the set. Median identification rank is 0.89 (black arrow), which is higher than 0.50 chance level (dashed line; p<0.0001; randomization test). Statistical significance was assessed by a randomization test in which a null distribution of the median was constructed by randomly shuffling the word pairs 10,000 times, computing median identification rank for each shuffle, and calculating the percentile rank of the true median in the null distribution. Best performance was achieved after smoothing the spectrograms with a 2-D box filter (500 ms, 2 octaves). (B) Receiver operating characteristic (ROC) plot of identification performance (red curve). Diagonal black line indicates no predictive power. (C) Examples of accurately (right) and inaccurately (left) identified words. Left: reconstruction of pseudoword “heef” is poor and leads to a low identification rank (0.13). Right: reconstruction of pseudoword “thack” is accurate and best matches the correct word out of 46 other candidate words (identification rank = 1.0). (D) Actual and reconstructed word similarity is correlated (r = 0.41). Pair-wise similarity between the original spectrograms of individual words is correlated with pair-wise similarity between the reconstructed and original spectrograms. Plotted values are computed prior to spectrogram smoothing used in the identification algorithm. Gray points denote the similarity between identical words.

See this image and copyright information in PMC

References

1. Young E. D. Neural representation of spectral and temporal information in speech. Philos Trans R Soc Lond B Biol Sci. 2008;363:923–945. - PMC - PubMed
1. Joris P. X, Schreiner C. E, Rees A. Neural processing of amplitude-modulated sounds. Physiol Rev. 2004;84:541–577. - PubMed
1. Schreiner C. E, Froemke R. C, Atencio C. A. Spectral processing in auditory cortex. In: Winer J. A, Schreiner C. E, editors. The auditory cortex. Springer US; 2011. pp. 275–308.
1. Hickok G, Poeppel D. The cortical organization of speech processing. Nat Rev Neurosci. 2007;8:393–402. - PubMed
1. Rauschecker J. P, Scott S. K. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nat Neurosci. 2009;12:718–724. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reconstructing speech from human auditory cortex

Affiliation

Reconstructing speech from human auditory cortex

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources