Brain-to-text: decoding spoken phrases from phone representations in the brain

Christian Herff¹, Dominic Heger¹, Adriana de Pesters², Dominic Telaar¹, Peter Brunner³, Gerwin Schalk⁴, Tanja Schultz¹

Affiliations

¹ Cognitive Systems Lab, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology Karlsruhe, Germany.
² New York State Department of Health, National Center for Adaptive Neurotechnologies, Wadsworth Center Albany, NY, USA ; Department of Biomedical Sciences, State University of New York at Albany Albany, NY, USA.
³ New York State Department of Health, National Center for Adaptive Neurotechnologies, Wadsworth Center Albany, NY, USA ; Department of Neurology, Albany Medical College Albany, NY, USA.
⁴ New York State Department of Health, National Center for Adaptive Neurotechnologies, Wadsworth Center Albany, NY, USA ; Department of Biomedical Sciences, State University of New York at Albany Albany, NY, USA ; Department of Neurology, Albany Medical College Albany, NY, USA.

PMID: 26124702
PMCID: PMC4464168
DOI: 10.3389/fnins.2015.00217

Brain-to-text: decoding spoken phrases from phone representations in the brain

Christian Herff et al. Front Neurosci. 2015.

. 2015 Jun 12:9:217.

doi: 10.3389/fnins.2015.00217. eCollection 2015.

Authors

Christian Herff¹, Dominic Heger¹, Adriana de Pesters², Dominic Telaar¹, Peter Brunner³, Gerwin Schalk⁴, Tanja Schultz¹

Affiliations

¹ Cognitive Systems Lab, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology Karlsruhe, Germany.
² New York State Department of Health, National Center for Adaptive Neurotechnologies, Wadsworth Center Albany, NY, USA ; Department of Biomedical Sciences, State University of New York at Albany Albany, NY, USA.
³ New York State Department of Health, National Center for Adaptive Neurotechnologies, Wadsworth Center Albany, NY, USA ; Department of Neurology, Albany Medical College Albany, NY, USA.
⁴ New York State Department of Health, National Center for Adaptive Neurotechnologies, Wadsworth Center Albany, NY, USA ; Department of Biomedical Sciences, State University of New York at Albany Albany, NY, USA ; Department of Neurology, Albany Medical College Albany, NY, USA.

PMID: 26124702
PMCID: PMC4464168
DOI: 10.3389/fnins.2015.00217

Abstract

It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.

Keywords: ECoG; automatic speech recognition; brain-computer interface; broadband gamma; electrocorticography; pattern recognition; speech decoding; speech production.

PubMed Disclaimer

Figures

**Figure 1**
**Electrode positions for all seven subjects**. Captions include age [years old (y/o)] and sex of subjects. Electrode locations were identified in a post-operative CT and co-registered to preoperative MRI. Electrodes for subject 3 are on an average Talairach brain. Combined electrode placement in joint Talairach space for comparison of all subjects. Participant 1 (yellow), subject 2 (magenta), subject 3 (cyan), subject 5 (red), subject 6 (green), and subject 7 (blue). Participant 4 was excluded from joint analysis as the data did not yield sufficient activations related to speech activity (see Section 2.4).

**Figure 2**
**Synchronized recording of ECoG and acoustic data**. Acoustic data are labeled using our in-house decoder BioKIT, i.e., the acoustic data samples are assigned to corresponding phones. These phone labels are then imposed on the neural data.

**Figure 3**
**Overview of the** **Brain-to-Text** **system:** ECoG broadband gamma activities (50 ms segments) for every electrode are recorded. Stacked broadband gamma features are calculated (Signal processing). Phone likelihoods over time can be calculated by evaluating all Gaussian ECoG phone models for every segment of ECoG features. Using ECoG phone models, a dictionary and an n-gram language model, phrases are decoded using the Viterbi algorithm. The most likely word sequence and corresponding phone sequence are calculated and the phone likelihoods over time can be displayed. Red marked areas in the phone likelihoods show most likely phone path. See also Supplementary Video.

**Figure 4**
**Mean Kullback-Leibler Divergences between models for every electrode position of every subject**. Combined electrode montage of all subjects except subject 4 in common Talairach space. Heat maps on rendered average brain shows regions of high discriminability (red). All shown discriminability exceeds chance level (larger than 99% of randomized discriminabilities). The temporal course of regions with high discriminability between phone models shows early differences in diverse areas up to 200 ms before the actual phone production. Phone models show high discriminability in sensorimotor cortex 50 ms before production and yield different models in auditory regions of the superior temporal gyrus 100 ms after production.

**Figure 5**
**Results: (A)** Frame-wise accuracy for all sessions. All sessions of all subjects show significantly higher true positive rates for *Brain-To-Text* (green bars) than for the randomized models (orange bars). **(B)** Confusion matrix for subject 7, session 1. The clearly visible diagonal indicates that all phones are decoded reliably. **(C)** Word Error Rates depending on dictionary size (lines). Word error rates for *Brain-To-Text* (green line) are lower than the randomized models for all dictionary sizes. Average true-positive rates across phones depending on dictionary size (bars) for subject 7, session 1. Phone true positive rates remain relatively stable for all dictionary sizes and are always much higher for *Brain-To-Text* than for the randomized models.

See this image and copyright information in PMC

References

1. Unknown. (2009). “Traitor among us” and “Split Feelings”. Available online at: https://www.fanfiction.net/
1. Blakely T., Miller K. J., Rao R. P., Holmes M. D., Ojemann J. G. (2008). Localization and classification of phonemes using high spatial resolution electrocorticography (ECoG) grids, in Engineering in Medicine and Biology Society, 2008. EMBC 2008. 30th Annual International Conference of the IEEE (Vancouver, BC: IEEE; ), 4964–4967. - PubMed
1. Bouchard K., Chang E. (2014). Neural decoding of spoken vowels from human sensory-motor cortex with high-density electrocorticography, in Engineering in Medicine and Biology Society, 2014. EMBC 2014. 36th Annual International Conference of the IEEE (Chicago, IL: IEEE; ). - PubMed
1. Bouchard K. E., Mesgarani N., Johnson K., Chang E. F. (2013). Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332. 10.1038/nature11911 - DOI - PMC - PubMed
1. Brumberg J. S., Wright E. J., Andreasen D. S., Guenther F. H., Kennedy P. R. (2011). Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech-motor cortex. Front. Neurosci. 5:65. 10.3389/fnins.2011.00065 - DOI - PMC - PubMed

Grants and funding

P41 EB018783/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Brain-to-text: decoding spoken phrases from phone representations in the brain

Affiliations

Brain-to-text: decoding spoken phrases from phone representations in the brain

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous