Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul;25(7):473-492.
doi: 10.1038/s41583-024-00819-9. Epub 2024 May 14.

The speech neuroprosthesis

Affiliations
Review

The speech neuroprosthesis

Alexander B Silva et al. Nat Rev Neurosci. 2024 Jul.

Abstract

Loss of speech after paralysis is devastating, but circumventing motor-pathway injury by directly decoding speech from intact cortical activity has the potential to restore natural communication and self-expression. Recent discoveries have defined how key features of speech production are facilitated by the coordinated activity of vocal-tract articulatory and motor-planning cortical representations. In this Review, we highlight such progress and how it has led to successful speech decoding, first in individuals implanted with intracranial electrodes for clinical epilepsy monitoring and subsequently in individuals with paralysis as part of early feasibility clinical trials to restore speech. We discuss high-spatiotemporal-resolution neural interfaces and the adaptation of state-of-the-art speech computational algorithms that have driven rapid and substantial progress in decoding neural activity into text, audible speech, and facial movements. Although restoring natural speech is a long-term goal, speech neuroprostheses already have performance levels that surpass communication rates offered by current assistive-communication technology. Given this accelerated rate of progress in the field, we propose key evaluation metrics for speed and accuracy, among others, to help standardize across studies. We finish by highlighting several directions to more fully explore the multidimensional feature space of speech and language, which will continue to accelerate progress towards a clinically viable speech neuroprosthesis.

PubMed Disclaimer

Conflict of interest statement

Competing interests

D.A.M., J.R.L. and E.F.C. are inventors on a pending provisional UCSF patent application that is relevant to the neural-decoding approaches surveyed in this work. E.F.C. is an inventor on patent application PCT/US2020/028926, D.A.M. and E.F.C. are inventors on patent application PCT/US2020/043706 and E.F.C. is an inventor on patent US9905239B2, which are broadly relevant to the neural-decoding approaches surveyed in this work. EFC is co-founder of Echo Neurotechnologies, LLC. All other authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. Key milestones in speech decoding.
Timeline of key advancements that have ultimately led to proof-of-concept speech neuroprostheses for individuals with paralysis. Advancements are labelled based on their study population, neural-recording technology and research goal–,,,,,,,,,,,–.
Fig. 2 |
Fig. 2 |. Articulatory control of speech.
Speech articulation relies on the corticobulbar system. At a broad level, this system is composed of cortical neuronal populations that project axons to the brainstem, where activations are relayed through cranial nerves to the speech articulators (muscles). a, Neural populations, arranged somatotopically on the ventral sensorimotor cortex (vSMC) and middle precentral gyrus (midPrCG), control the movements of key vocal-tract articulators, such as the larynx, tongue, jaw and lips. These neural populations may receive input from other regions involved in speech, a few of which are highlighted (superior temporal gyrus (STG) and the supramarginal gyrus (SMG)). The vSMC and midPrCG send motor-control signals through their axons, which bundle and form the corticobulbar white matter tract, that terminate in cranial nerve nuclei in the brainstem. Neurons in cranial nerve nuclei then send axons, which bundle and form cranial nerves, that innervate the speech articulators (larynx, tongue, jaw and lips). b, Cortical-activity patterns, ultimately transmitted by the cranial nerves, lead to contraction of the vocal-tract articulators and defined vocal-tract configurations that can broadly be grouped based on the place of air constriction into four classes: vocalic, back tongue, front tongue and labial. c, Continuous movements of the vocal-tract articulators between these configurations, along with air from respiratory structures, turn neural activity related to intended speech into vocalized sound waves. Continuous articulatory features can be measured for different landmarks in the vocal tract over time. Here, the visualized articulatory features are inferred from the produced acoustic waveform. d, The produced speech can be represented as an acoustic waveform (amplitude over time). The envelope can be estimated from the acoustic waveform and represents the intensity of speech over time, an important measurement that correlates with speech rate, stress patterns and loudness e, Speech can also be represented as a mel-spectrogram, in which the power of different perceptually salient frequency bands is shown over time. Pitch can be computed by computational algorithms that estimate the fundamental frequency of a signal. f, Defined patterns in the produced sound, visible on a mel-spectrogram, form the basis of meaning. Phonemes are a type of linguistic feature and refer to the smallest perceptually distinct units of sound (epochs denoted by dotted lines in panels d and e) that form a language. Phonemes, along with words, can be annotated and inferred based on the produced sound during continuous speech. In addition, the vocal-tract configuration that gives rise to a distinct unit of sound can be used to group phonemes. Visualization of articulatory control of speech in panels cf from produced sound was created using algorithms from refs. –.
Fig. 3 |
Fig. 3 |. Decoding speech from neural activity.
Speech-decoding systems follow a similar heuristic; neural activity during intended speech is captured with an interface of choice and relevant features are extracted and processed by a decoding model. This decoding model can be trained to transform neural activity into text, audible speech or orofacial movements. a, Recording of neural activity can be achieved using different neural interfaces such as electrocorticography (ECoG), microelectrode array (MEA) and stereoelectroencephalography (SEEG). Recorded neural activity is processed into neural features, which are then passed to a speech feature decoder, which might have been trained to output linguistic, acoustic or articulatory features as intermediate speech representations. b, For text decoding, models can be trained to decode neural features into sequences of linguistic features, such as phonemes (colours indicate their vocal-tract configuration), and then a defined vocabulary (via a lexicon constraint) and natural language modelling can be used to transform phoneme sequences into text sequences of plausible words and sentences. c, For speech synthesis, models can be trained to decode neural features into sequences of acoustic features, such as the mel-spectrogram, which can then be vocoded into an audible speech waveform, often using pretrained models from the field of speech processing. Importantly, the vocoder can be personalized in a way that captures the previous intact voice of the individual. d, Models may also be trained to decode neural features into sequences of articulatory features, such as the relative displacement of different locations in the vocal tract over time (gesture activation). A gesture-animation system may be applied to the gesture activation sequences to animate a digital avatar. Similar to speech synthesis, the avatar may be personalized to reflect the likeness of the users, using digital-face capture software. Optional conversion between text, speech and facial-avatar animation outputs is feasible using pretrained speech-processing models (dashed arrows). Visualization of acoustic and articulatory features in panels bd was created using data and algorithms from ref. . The personalized avatar mesh animations in panel d were generated using Unreal Engine animation software (Speech Graphics, Edinburgh, UK). Panel d adapted from ref. , Springer Nature Limited.
Fig. 4 |
Fig. 4 |. Evaluating and standardizing speech neuroprostheses.
As the field of speech neuroprosthetics continues to accelerate, it is important to define standardized methods of reporting and evaluating speech-decoding results. Here, we propose several metrics to become standardized, specifically relating to speech instructions, decoded outputs, utterance sets used to train and evaluate models and finally user training times. a, Common types of speech instructions given to individuals in speech-decoding studies include imagined, silently attempted and attempted speech (right). These intended speech types are distinct from internal thoughts or monologue of an individual (left). b, For text decoding, metrics related to the accuracy, speed and lexicon and language model of the system should be reported and standardized. To measure accuracy, the WER and PER (alternatively, the character error rate can be reported if the model is trained to decode sequences of characters) should be used and are defined as the edit distance between ground truth and decoded word and phoneme sequences. To measure speed, the WPM can be reported as the time between the onset of the speech attempt and the final processed sample of neural activity. Finally, the vocabulary size used to define a language model and lexicon should be reported. c, For speech synthesis, metrics related to accuracy and latency of the system should be reported and standardized. To measure accuracy, the MCD should be computed between ground truth and synthesized speech, capturing distortion in perceptually salient frequency bands. For a more interpretable measure of performance, the human-transcribed WER should be reported by having volunteers transcribe synthesized speech into text, which can then be compared with ground truth. Finally, the latency of the speech-synthesis system should be reported and can be defined as the time between the onset of a speech attempt and the first sample of synthesized audio that is played back to the participant. Visualization of acoustic waveforms and mel-spectrograms were created using data and algorithms from ref. . d, Utterance-set metrics, common to text-decoding and speech-synthesis systems, should also be reported. First, qualities of the training-utterance set, such as number of unique words (vocabulary size) and sentences, should be reported. Next, the lexicon and language model vocabulary size (as in panel b) should be reported. Finally, key characteristics of the evaluation-utterance set should also be reported. This includes the vocabulary size of, the number of sentences in and the length of sentences in the evaluation-utterance set. Importantly, the overlap between the training and evaluation sets should also be quantified and reported as the number of overlapping words and sentences between the two sets. e, Training-time metrics, also common to text-decoding and speech-synthesis systems, should be reported. The total amount of training data quantified as number of hours to reach usable and/or goal performance should be reported. The number of days the system can maintain usable performance, without supervised re-training (either by not re-training or using self-supervised recalibration techniques), should also be reported. Panel e adapted from ref. , Springer Nature Limited.

References

    1. Felgoise SH, Zaccheo V, Duff J & Simmons Z Verbal communication impacts quality of life in patients with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. Front. Degener 17, 179–183 (2016). - PubMed
    1. Das JM, Anosike K & Asuncion RMD Locked-in syndrome. StatPearls https://www.ncbi.nlm.nih.gov/books/NBK559026/ (StatPearls, 2021). - PubMed
    1. Lulé D et al. Life can be worth living in locked-in syndrome. Prog. Brain Res 177, 339–351 (2009). - PubMed
    1. Pels EGM, Aarnoutse EJ, Ramsey NF & Vansteensel MJ Estimated prevalence of the target population for brain–computer interface neurotechnology in the Netherlands. Neurorehabil. Neural Repair 31, 677–685 (2017). - PMC - PubMed
    1. Koch Fager S, Fried-Oken M, Jakobs T & Beukelman DR New and emerging access technologies for adults with complex communication needs and severe motor impairments: state of the science. Augment. Altern. Commun. Baltim. MD 1985 35, 13–25 (2019). - PMC - PubMed