Speech synthesis from ECoG using densely connected 3D convolutional neural networks

Miguel Angrick¹, Christian Herff, Emily Mugler, Matthew C Tate, Marc W Slutzky, Dean J Krusienski, Tanja Schultz

Affiliations

PMID: 30831567
PMCID: PMC6822609
DOI: 10.1088/1741-2552/ab0c59

Speech synthesis from ECoG using densely connected 3D convolutional neural networks

Miguel Angrick et al. J Neural Eng. 2019 Jun.

. 2019 Jun;16(3):036019.

doi: 10.1088/1741-2552/ab0c59. Epub 2019 Mar 4.

Authors

Miguel Angrick¹, Christian Herff, Emily Mugler, Matthew C Tate, Marc W Slutzky, Dean J Krusienski, Tanja Schultz

Affiliation

¹ Cognitive Systems Lab, University of Bremen, Bremen, Germany.

PMID: 30831567
PMCID: PMC6822609
DOI: 10.1088/1741-2552/ab0c59

Abstract

Objective: Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech.

Approach: Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant.

Main results: In a study with six participants, we achieved correlations up to r = 0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output.

Significance: To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.

PubMed Disclaimer

Figures

**Figure 1.**
Illustration of the experiment. Participants are asked to repeat words shown on a screen. During speech production, ECoG data and acoustic stream are recorded simultaneously.

**Figure 2.**
Overview of the decoding approach illustrating the transformation of neural data into an audible waveform. ECoG features for each time window are fed into DenseNet regression model to reconstruct the logarithmic mel-scaled spectrogram. Wavenet is then used to reconstruct an audio waveform from the spectrogram.

**Figure 3.**
Overview of the DenseNet network structure. Input samples are preprocessed features of the neural signal with the shape 8 × 8 × 9. The first two dimensions are used for the spatial alignment of the electrodes, while the third dimension comprises the temporal dynamics. The network architecture consists of three Dense Blocks to map the neural features onto the speech spectrogram.

**Figure 4.**
Overview of the Wavenet vocoder architecture. The network comprises a stack of 30 residual blocks to find a mapping between the acoustic speech signal x to itself considering the extracted features c. Each block has a separate output which are summed in the calculation of the actual prediction. We use a 10-component mixture of logistic distributions (MoL) for the prediction of audio samples.

**Figure 5.**
Reconstruction performance of DenseNet compared to random chance. (a) Pearson correlation coefficients between original and reconstructed spectrograms for each participant. Bars indicate the mean over all logarithmic mel-scaled coefficients while whiskers denote the standard deviation. (b) Detailed performance across all spectral bins for participant 5. (c) STOI scores as an objective intelligibility measure in comparison to the chance level.

**Figure 6.**
Reconstruction example for visual inspection. a) compares a time-aligned excerpt in the spectral domain of participant 5 and emphasizes the quality of the reconstructed acoustic speech characteristics. b) shows the generated waveform representation of the same excerpt as in the spectrogram comparison. Spoken words are given below.

See this image and copyright information in PMC

References

1. Pirila S, van der Meere J, Pentikainen T, Ruusu-Niemi P, Korpela R, Kilpinen J, and Nieminen P, “Language and motor speech skills in children with cerebral palsy,” Journal of communication disorders, vol. 40, no. 2, pp. 116–128, 2007. - PubMed
1. Turner GS, Tjaden K, and Weismer G, “The influence of speaking rate on vowel space and speech intelligibility for individuals with amyotrophic lateral sclerosis,” Journal of Speech, Language, and Hearing Research, vol. 38, no. 5, pp. 1001–1013, 1995. - PubMed
1. Kent RD, Kent JF, Weismer G, Sufit RL, Rosenbek JC, Martin RE, and Brooks BR, “Impairment of speech intelligibility in men with amyotrophic lateral sclerosis,” Journal of Speech and Hearing Disorders, vol. 55, no. 4, pp. 721–728, 1990. - PubMed
1. Starmer HM, Tippett DC, and Webster KT, “Effects of laryngeal cancer on voice and swallowing,” Otolaryngologic Clinics of North America, vol. 41, no. 4, pp. 793–818, 2008. - PubMed
1. Schultz T, Wand M, Hueber T, Krusienski DJ, Herff C, and Brumberg JS, “Biosignal-based spoken communication: A survey,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 12, pp. 2257–2271, November 2017.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Speech synthesis from ECoG using densely connected 3D convolutional neural networks

Affiliation

Speech synthesis from ECoG using densely connected 3D convolutional neural networks

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources