Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 9;34(17):4021-4032.e5.
doi: 10.1016/j.cub.2024.07.073. Epub 2024 Aug 16.

Auditory cortex encodes lipreading information through spatially distributed activity

Affiliations

Auditory cortex encodes lipreading information through spatially distributed activity

Ganesan Karthik et al. Curr Biol. .

Abstract

Watching a speaker's face improves speech perception accuracy. This benefit is enabled, in part, by implicit lipreading abilities present in the general population. While it is established that lipreading can alter the perception of a heard word, it is unknown how these visual signals are represented in the auditory system or how they interact with auditory speech representations. One influential, but untested, hypothesis is that visual speech modulates the population-coded representations of phonetic and phonemic features in the auditory system. This model is largely supported by data showing that silent lipreading evokes activity in the auditory cortex, but these activations could alternatively reflect general effects of arousal or attention or the encoding of non-linguistic features such as visual timing information. This gap limits our understanding of how vision supports speech perception. To test the hypothesis that the auditory system encodes visual speech information, we acquired functional magnetic resonance imaging (fMRI) data from healthy adults and intracranial recordings from electrodes implanted in patients with epilepsy during auditory and visual speech perception tasks. Across both datasets, linear classifiers successfully decoded the identity of silently lipread words using the spatial pattern of auditory cortex responses. Examining the time course of classification using intracranial recordings, lipread words were classified at earlier time points relative to heard words, suggesting a predictive mechanism for facilitating speech. These results support a model in which the auditory system combines the joint neural distributions evoked by heard and lipread words to generate a more precise estimate of what was said.

Keywords: ECoG; audiovisual; iEEG; multisensory; sEEG; speech.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. fMRI task schematic.
Schematic of auditory and visual trials. Auditory trials began with a fixation cross followed by a CV stimulus. Visual trials presented the visual components of these same recordings without the corresponding audio track. After stimulus offset, subjects were cued to identify which of the three phonemes (or visemes) they saw (or heard).
Figure 2.
Figure 2.. Univariate activations.
(a) phonemes vs fixation (b) visemes vs fixation, and (c) phonemes vs visemes. Colored regions reflect significant increases (red and yellow) or decreases (blues) in task-related activation. (d) BOLD time-series extracted from four regions (averaged across left and right hemispheres). Data were normalized based on the maximum average value between 3.2 (TR=5) and 9.6 (TR=13) seconds to better visualize relative shapes of the time-series. Notably, phonemes and visemes showed a similar temporal response in the STG but visemes elicited activity in the pSTS more quickly than phonemes. Stimulus onset occurred at TR=1. See also Figure S1, Tables S1–2.
Figure 3.
Figure 3.. fMRI decoding of phoneme and viseme information in an event-related design.
(a-c) Multiple comparison corrected searchlight-based MVPA classification in n = 64 subjects. Classifiers were trained to identify (a) the phoneme heard in the auditory-only condition or (b) the viseme seen in the visual-only condition. Chance accuracy is 33.3%. (a) Peak phoneme decoding was observed in the bilateral STG. (b) Significant viseme decoding was observed in the bilateral STG, left pSTS, and visual regions. (c) Vertices with significant classification of phonemes but not visemes (red), visemes but not phonemes (blue), or with significant classification of both phonemes and visemes (purple). There is a large overlap in the vertices at which visemes and phonemes could be classified. Restricted to the just the STG, vertices at which viseme classification was significant covered roughly half of the area in the STG that phonemes were classified successfully at (48.1% overlap) with negligible area uniquely able to classify visemes. (d) ROIs used for hypothesis driven classification at the single-subject level. (e) Results of classification at selected ROIs. Phonemes were significantly classified from the left STG and pSTS. Visemes were significantly classified from the left STG, pSTS, and visual cortex. Center line reflects the mean, colored box SE, and the tails 95% confidence intervals. *p<.05, ***p<.001. See also Figure S2, Table S3.
Figure 4.
Figure 4.. iEEG results during an auditory-only (listening) and visual-only (lipreading) speech perception paradigm.
(a) Distribution of all recorded electrodes (those beneath the pial surface not shown) (n = 14 patients). Colors denote hospital the hospital at which data were collected: yellow electrodes (n = 3 patients; HF) and purple electrodes (n = 11 patients; UM). (b-d) Event-related spectral perturbations (ERSP) plots from all STG electrodes, averaged across subjects. (e) HGp responses from two STG electrodes in response to auditory-only trials (phonemes; red lines) and visual-only trials (visemes; blue lines). Posterior STG electrodes showed increased HGp responses to visemes before the time when speech sounds would be expected to begin. (f) HGp responses from two fusiform gyrus electrodes. Shaded regions reflect single condition 95% confidence intervals. Light gray boxes show significant between condition differences (multiple comparisons corrected using FDR). See also Figure S3.
Figure 5.
Figure 5.. iEEG classification of phoneme and viseme identities from auditory (n = 14 patients) and visual (n = 5 patients) regions.
(a) SVM classifier accuracy for the initial consonants (‘B’, ‘F’, ‘G’, or ‘D’) from either auditory-only or visual-only words classified at the individual-subject level. Chance accuracy is 25% and plots show group-level boxplots. (b-c) Group-averaged confusion matrices. Cells denote the frequency at which each consonant-initial word was predicted (x-axis) relative to the true labels (y-axis). (d) Group-averaged classification at individual time-points from STG electrodes (phoneme-onset at 0 sec) showing significant classification accuracy for both auditory-only and visual-only trials shortly after phoneme onset; in the visual-only condition, this time-point reflected the associated speech onset time even though no auditory stimulus was presented. Shaded region reflects SEM. (e) Spatial distribution of electrodes at which auditory-only (red) or visual-only (blue) trials were reliably classified (p<.05 based on binomial statistics and multiple comparison corrected using FDR); purple electrodes reflect significant classification in both conditions and gray electrodes reflect non-significant classification in either condition. Electrodes beneath the pial surface were projected out to the lateral surface for visualization. (f) Scatter plot quantifying the similarity of mis-classification rates for auditory-only trials and visual-only trials. Data reflect pairwise classification values taken from the off-diagonal cells in panels b and c, with the first letter denoting the real consonant label and the second letter the predicted consonant label. For example, ‘G’ trials and ‘D’ trials were most readily confused by both the auditory-only and visual-only classifiers. (g) Group-level classification accuracy showing that responses in the fusiform gyrus can distinguish between different visemes but not phonemes. (h-i) Group-level confusion matrices for auditory-only and visual-only trials from fusiform gyrus electrodes. See also Tables S5–6.

References

    1. Plass J, Brang D, Suzuki S, and Grabowecky M (2020). Vision perceptually restores auditory spectral dynamics in speech. Proc Natl Acad Sci U S A 117, 16920–16927. 10.1073/pnas.2002887117. - DOI - PMC - PubMed
    1. Micheli C, Schepers IM, Ozker M, Yoshor D, Beauchamp MS, and Rieger JW (2020). Electrocorticography reveals continuous auditory and visual speech tracking in temporal and occipital cortex. Eur J Neurosci 51, 1364–1376. 10.1111/ejn.13992. - DOI - PMC - PubMed
    1. Ross LA, Saint-Amour D, Leavitt VM, Javitt DC, and Foxe JJ (2007). Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebral cortex 17, 1147–1153. - PubMed
    1. Rosemann S, and Thiel CM (2018). Audio-visual speech processing in age-related hearing loss: Stronger integration and increased frontal lobe recruitment. Neuroimage 175, 425–437. - PubMed
    1. Aabedi AA, Kakaizada S, Young JS, Kaur J, Wiese O, Valdivia C, Krishna S, Weyer-Jamora C, Berger MS, Weissman DH, et al. (2021). Convergence of heteromodal lexical retrieval in the lateral prefrontal cortex. Sci Rep 11, 6305. 10.1038/s41598-021-85802-5. - DOI - PMC - PubMed

LinkOut - more resources