Decoding and synthesizing tonal language speech from brain activity

Yan Liu^{1

2

3

4}, Zehao Zhao^{1

2

3

4}, Minpeng Xu^{5

6}, Haiqing Yu⁵, Yanming Zhu^{1

2

3

4}, Jie Zhang^{1

2

3

4}, Linghao Bu^{1

2

3

4

7}, Xiaoluo Zhang^{1

2

3

4}, Junfeng Lu^{1

2

3

4

8}, Yuanning Li⁹, Dong Ming^{5

6}, Jinsong Wu^{1

2

3

4}

Affiliations

¹ Department of Neurosurgery, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai 200040, China.
² National Center for Neurological Disorders, Shanghai 200052, China.
³ Shanghai Key Laboratory of Brain Function Restoration and Neural Regeneration, Shanghai 200040, China.
⁴ Neurosurgical Institute of Fudan University, Shanghai 200052, China.
⁵ Department of Biomedical Engineering, College of Precision Instruments and Optoelectronics Engineering, Tianjin University, Tianjin 300041, China.
⁶ Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300041, China.
⁷ Department of Neurosurgery, First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310000, China.
⁸ MOE Frontiers Center for Brain Science, Huashan Hospital, Fudan University, Shanghai 200040, China.
⁹ School of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, China.

PMID: 37294753
PMCID: PMC10256166
DOI: 10.1126/sciadv.adh0478

Decoding and synthesizing tonal language speech from brain activity

Yan Liu et al. Sci Adv. 2023.

. 2023 Jun 9;9(23):eadh0478.

doi: 10.1126/sciadv.adh0478. Epub 2023 Jun 9.

Authors

Affiliations

¹ Department of Neurosurgery, Huashan Hospital, Shanghai Medical College, Fudan University, Shanghai 200040, China.
² National Center for Neurological Disorders, Shanghai 200052, China.
³ Shanghai Key Laboratory of Brain Function Restoration and Neural Regeneration, Shanghai 200040, China.
⁴ Neurosurgical Institute of Fudan University, Shanghai 200052, China.
⁵ Department of Biomedical Engineering, College of Precision Instruments and Optoelectronics Engineering, Tianjin University, Tianjin 300041, China.
⁶ Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300041, China.
⁷ Department of Neurosurgery, First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310000, China.
⁸ MOE Frontiers Center for Brain Science, Huashan Hospital, Fudan University, Shanghai 200040, China.
⁹ School of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, China.

PMID: 37294753
PMCID: PMC10256166
DOI: 10.1126/sciadv.adh0478

Abstract

Recent studies have shown that the feasibility of speech brain-computer interfaces (BCIs) as a clinically valid treatment in helping nontonal language patients with communication disorders restore their speech ability. However, tonal language speech BCI is challenging because additional precise control of laryngeal movements to produce lexical tones is required. Thus, the model should emphasize the features from the tonal-related cortex. Here, we designed a modularized multistream neural network that directly synthesizes tonal language speech from intracranial recordings. The network decoded lexical tones and base syllables independently via parallel streams of neural network modules inspired by neuroscience findings. The speech was synthesized by combining tonal syllable labels with nondiscriminant speech neural activity. Compared to commonly used baseline models, our proposed models achieved higher performance with modest training data and computational costs. These findings raise a potential strategy for approaching tonal language speech restoration.

PubMed Disclaimer

Figures

**Fig. 1.. Electrode coverage and category.**
(A) Anatomical reconstructions of all participants. The locations of the ECoG electrodes were plotted with colored discs. The colors indicated the electrode categories (see Materials and Methods). (B) Venn diagram of all speech-responsive electrodes in all participants, broken down into four categories (nonresponsive category was not plotted). (C) The averaged high-γ responses regarding different lexical tones during tone production from five example electrodes, time-locked to speech onsets [electrode locations were plotted in (A) as stars in different colors]. Black dots indicated time points of significance. For the top row in (C), black dots indicated the time when there was a significant difference in the mean high-γ activity between the two syllables. (t test, P < 0.05, Bonferroni corrected). For the bottom row in (C), black dots indicated the time when there was a significant difference in the mean high-γ activity between the four tones. (F test, P < 0.05, Bonferroni corrected).

**Fig. 2.. The model architecture and speech synthesis pipeline.**
(A) Each participant articulated the eight tonal syllables and their neural activity was recorded with ECoG grids (256 electrodes) covering the peri-Sylvian cortices. The analytic amplitudes of the high-γ activity (70 to 150 Hz) were extracted and clipped to the length of 1 s and supplied as input to the speech decoding model. The electrodes were classified into one of five categories and then fed into different decoding streams according to their category assignments. The illustration of the five-level tone marks demonstrated the pitch contours of flat-high tone (tone 1), medium-rising tone (tone 2), low-dipping tone (tone 3), and high-falling tone (tone 4) in Mandarin. (B) Tone discriminative electrodes (e.g., electrode E1 in Fig. 1C) and tone and syllable discriminative electrodes (e.g., electrode E3 in Fig. 1C) were fed into a parallel CNN-LSTM network to generate the tone label. (C) Syllable discriminative electrodes (e.g., electrode E2 in Fig. 1C) and tone and syllable discriminative electrodes (e.g., electrode E3 in Fig. 1C) were fed into a sequential CNN-LSTM network to generate the syllable label. (D) The synthesis network combined the signals from nondiscriminative electrodes and the outputs of (B) and (C) to generate the Mel spectrogram of speech sound. (E) The sound wave was synthesized from the Mel spectrogram via the Griffin-Lim algorithm (see audio S1).

**Fig. 3.. The tonal syllable decoding accuracy of different models.**
(A and B) Bar plots showed the decoding accuracy (means ± SEM) of sequential (red bar) and parallel (blue bar) CNN-LSTM network using (A) syllable discriminative electrodes and (B) tone discriminative electrodes. The blue dashed line indicates the syllable chance level. The red dashed line indicates the tone chance level. *P < 0.05 and **P < 0.01; ns, nonsignificant; two-sided t test for independent samples. (C) Bar plots showed the decoding accuracy (means ± SEM) of the label generator, sequential and parallel CNN-LSTM network, CNN, and VGG16 model (bars are color-coded by models). *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001; compared to the accuracy of the label generator of the same subject, two-sided t test for independent samples. See Fig. 2 (B and C) for the architectures of sequential and parallel CNN-LSTM networks and the whole label generator.

**Fig. 4.. Evaluation of the synthesized speech sound quality.**
(A) A pair of examples of the synthesized and original syllable sound spectrograms (compressed to an 80 × 44 matrix Mel spectrogram). Their MCD was 2.16 dB. (B) Bar plot showing the MCD (means ± SEM) of different lexical tones in five participants. (C) Bar plot showing the tone IAs (means ± SEM) of the synthesized sound wave and the ground truth by 31 listeners. (D) Scatterplot showing the correlation between the tone IAs of the synthesized sound wave and the ground truth (n = 160, Pearson’s correlation r = 0.70, P = 3.15 × 10⁻²⁵). (E) Bar plot showing the MOS (means ± SEM) of the synthesized sound wave and the ground truth evaluated by 31 listeners. (F) Scatterplot showing the correlation between the MOS of the synthesized sound wave and the ground truth (n = 160, Pearson’s correlation r = 0.68, P = 3.93 × 10⁻²³).

See this image and copyright information in PMC

References

1. M. Yip, Tone (Cambridge Univ. Press, 2002).
1. M. S. Dryer, M. Haspelmath, The world atlas of language structures online (2013); http://wals.info [Accessed on 2021 September 21].
1. J. D. McCawley, What is a tone language? in Tone (Elsevier, 1978), pp. 113–131.
1. Y.-R. Chao, A System of Tone Letters (Le maître phonétique, 1930).
1. Anumanchipalli G. K., Chartier J., Chang E. F., Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Decoding and synthesizing tonal language speech from brain activity

Affiliations

Decoding and synthesizing tonal language speech from brain activity

Authors

Affiliations

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources