Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 9;9(23):eadh0478.
doi: 10.1126/sciadv.adh0478. Epub 2023 Jun 9.

Decoding and synthesizing tonal language speech from brain activity

Affiliations

Decoding and synthesizing tonal language speech from brain activity

Yan Liu et al. Sci Adv. .

Abstract

Recent studies have shown that the feasibility of speech brain-computer interfaces (BCIs) as a clinically valid treatment in helping nontonal language patients with communication disorders restore their speech ability. However, tonal language speech BCI is challenging because additional precise control of laryngeal movements to produce lexical tones is required. Thus, the model should emphasize the features from the tonal-related cortex. Here, we designed a modularized multistream neural network that directly synthesizes tonal language speech from intracranial recordings. The network decoded lexical tones and base syllables independently via parallel streams of neural network modules inspired by neuroscience findings. The speech was synthesized by combining tonal syllable labels with nondiscriminant speech neural activity. Compared to commonly used baseline models, our proposed models achieved higher performance with modest training data and computational costs. These findings raise a potential strategy for approaching tonal language speech restoration.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. Electrode coverage and category.
(A) Anatomical reconstructions of all participants. The locations of the ECoG electrodes were plotted with colored discs. The colors indicated the electrode categories (see Materials and Methods). (B) Venn diagram of all speech-responsive electrodes in all participants, broken down into four categories (nonresponsive category was not plotted). (C) The averaged high-γ responses regarding different lexical tones during tone production from five example electrodes, time-locked to speech onsets [electrode locations were plotted in (A) as stars in different colors]. Black dots indicated time points of significance. For the top row in (C), black dots indicated the time when there was a significant difference in the mean high-γ activity between the two syllables. (t test, P < 0.05, Bonferroni corrected). For the bottom row in (C), black dots indicated the time when there was a significant difference in the mean high-γ activity between the four tones. (F test, P < 0.05, Bonferroni corrected).
Fig. 2.
Fig. 2.. The model architecture and speech synthesis pipeline.
(A) Each participant articulated the eight tonal syllables and their neural activity was recorded with ECoG grids (256 electrodes) covering the peri-Sylvian cortices. The analytic amplitudes of the high-γ activity (70 to 150 Hz) were extracted and clipped to the length of 1 s and supplied as input to the speech decoding model. The electrodes were classified into one of five categories and then fed into different decoding streams according to their category assignments. The illustration of the five-level tone marks demonstrated the pitch contours of flat-high tone (tone 1), medium-rising tone (tone 2), low-dipping tone (tone 3), and high-falling tone (tone 4) in Mandarin. (B) Tone discriminative electrodes (e.g., electrode E1 in Fig. 1C) and tone and syllable discriminative electrodes (e.g., electrode E3 in Fig. 1C) were fed into a parallel CNN-LSTM network to generate the tone label. (C) Syllable discriminative electrodes (e.g., electrode E2 in Fig. 1C) and tone and syllable discriminative electrodes (e.g., electrode E3 in Fig. 1C) were fed into a sequential CNN-LSTM network to generate the syllable label. (D) The synthesis network combined the signals from nondiscriminative electrodes and the outputs of (B) and (C) to generate the Mel spectrogram of speech sound. (E) The sound wave was synthesized from the Mel spectrogram via the Griffin-Lim algorithm (see audio S1).
Fig. 3.
Fig. 3.. The tonal syllable decoding accuracy of different models.
(A and B) Bar plots showed the decoding accuracy (means ± SEM) of sequential (red bar) and parallel (blue bar) CNN-LSTM network using (A) syllable discriminative electrodes and (B) tone discriminative electrodes. The blue dashed line indicates the syllable chance level. The red dashed line indicates the tone chance level. *P < 0.05 and **P < 0.01; ns, nonsignificant; two-sided t test for independent samples. (C) Bar plots showed the decoding accuracy (means ± SEM) of the label generator, sequential and parallel CNN-LSTM network, CNN, and VGG16 model (bars are color-coded by models). *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001; compared to the accuracy of the label generator of the same subject, two-sided t test for independent samples. See Fig. 2 (B and C) for the architectures of sequential and parallel CNN-LSTM networks and the whole label generator.
Fig. 4.
Fig. 4.. Evaluation of the synthesized speech sound quality.
(A) A pair of examples of the synthesized and original syllable sound spectrograms (compressed to an 80 × 44 matrix Mel spectrogram). Their MCD was 2.16 dB. (B) Bar plot showing the MCD (means ± SEM) of different lexical tones in five participants. (C) Bar plot showing the tone IAs (means ± SEM) of the synthesized sound wave and the ground truth by 31 listeners. (D) Scatterplot showing the correlation between the tone IAs of the synthesized sound wave and the ground truth (n = 160, Pearson’s correlation r = 0.70, P = 3.15 × 10−25). (E) Bar plot showing the MOS (means ± SEM) of the synthesized sound wave and the ground truth evaluated by 31 listeners. (F) Scatterplot showing the correlation between the MOS of the synthesized sound wave and the ground truth (n = 160, Pearson’s correlation r = 0.68, P = 3.93 × 10−23).

References

    1. M. Yip, Tone (Cambridge Univ. Press, 2002).
    1. M. S. Dryer, M. Haspelmath, The world atlas of language structures online (2013); http://wals.info [Accessed on 2021 September 21].
    1. J. D. McCawley, What is a tone language? in Tone (Elsevier, 1978), pp. 113–131.
    1. Y.-R. Chao, A System of Tone Letters (Le maître phonétique, 1930).
    1. Anumanchipalli G. K., Chartier J., Chang E. F., Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019). - PMC - PubMed