Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 8;13(1):6510.
doi: 10.1038/s41467-022-33611-3.

Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis

Affiliations

Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis

Sean L Metzger et al. Nat Commun. .

Abstract

Neuroprostheses have the potential to restore communication to people who cannot speak or type due to paralysis. However, it is unclear if silent attempts to speak can be used to control a communication neuroprosthesis. Here, we translated direct cortical signals in a clinical-trial participant (ClinicalTrials.gov; NCT03698149) with severe limb and vocal-tract paralysis into single letters to spell out full sentences in real time. We used deep-learning and language-modeling techniques to decode letter sequences as the participant attempted to silently spell using code words that represented the 26 English letters (e.g. "alpha" for "a"). We leveraged broad electrode coverage beyond speech-motor cortex to include supplemental control signals from hand cortex and complementary information from low- and high-frequency signal components to improve decoding accuracy. We decoded sentences using words from a 1,152-word vocabulary at a median character error rate of 6.13% and speed of 29.4 characters per minute. In offline simulations, we showed that our approach generalized to large vocabularies containing over 9,000 words (median character error rate of 8.23%). These results illustrate the clinical viability of a silently controlled speech neuroprosthesis to generate sentences from a large vocabulary through a spelling-based approach, complementing previous demonstrations of direct full-word decoding.

PubMed Disclaimer

Conflict of interest statement

S.L.M., J.R.L., D.A.M., and E.F.C. are inventors on a pending provisional patent application that is directly relevant to the neural-decoding approach used in this work. G.K.A and E.F.C are inventors on patent application PCT/US2020/028926, D.A.M. and E.F.C. are inventors on patent application PCT/US2020/043706 and E.F.C. is an inventor on patent US9905239B2 which are broadly relevant to the neural-decoding approach in this work. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Schematic depiction of the spelling pipeline.
a At the start of a sentence-spelling trial, the participant attempts to silently say a word to volitionally activate the speller. b Neural features (high-gamma activity and low-frequency signals) are extracted in real time from the recorded cortical data throughout the task. The features from a single electrode (electrode 0, Fig. 5a) are depicted. For visualization, the traces were smoothed with a Gaussian kernel with a standard deviation of 150 milliseconds. The microphone signal shows that there is no vocal output during the task. c The speech-detection model, consisting of a recurrent neural network (RNN) and thresholding operations, processes the neural features to detect a silent-speech attempt. Once an attempt is detected, the spelling procedure begins. d During the spelling procedure, the participant spells out the intended message throughout letter-decoding cycles that occur every 2.5 s. Each cycle, the participant is visually presented with a countdown and eventually a go cue. At the go cue, the participant attempts to silently say the code word representing the desired letter. e High-gamma activity and low-frequency signals are computed throughout the spelling procedure for all electrode channels and parceled into 2.5-s non-overlapping time windows. f An RNN-based letter-classification model processes each of these neural time windows to predict the probability that the participant was attempting to silently say each of the 26 possible code words or attempting to perform a hand-motor command (g). Prediction of the hand-motor command with at least 80% probability ends the spelling procedure (i). Otherwise, the predicted letter probabilities are processed by a beam-search algorithm in real time and the most likely sentence is displayed to the participant. g After the participant spells out his intended message, he attempts to squeeze his right hand to end the spelling procedure and finalize the sentence. h The neural time window associated with the hand-motor command is passed to the classification model. i If the classifier confirms that the participant attempted the hand-motor command, a neural network-based language model (DistilGPT-2) rescores valid sentences. The most likely sentence after rescoring is used as the final prediction.
Fig. 2
Fig. 2. Performance summary of the spelling system during the copy-typing task.
a Character error rates (CERs) observed during real-time sentence spelling with a language model (LM), denoted as ‘+LM (Real-time results)’, and offline simulations in which portions of the system were omitted. In the ‘Chance’ condition, sentences were created by replacing the outputs from the neural classifier with randomly generated letter probabilities without altering the remainder of the pipeline. In the ‘Only neural decoding’ condition, sentences were created by concatenating together the most likely character from each of the classifier’s predictions during a sentence trial (no whitespace characters were included). In the ‘+Vocab. constraints’ condition, the predicted letter probabilities from the neural classifier were used with a beam search that constrained the predicted character sequences to form words within the 1152-word vocabulary. The final condition ‘+ LM (Real-time results)’ incorporates language modeling. The sentences decoded with the full system in real time exhibited lower CERs than sentences decoded in the other conditions (***P < 0.0001, P-values provided in Table S2, two-sided Wilcoxon Rank-Sum test with 6-way Holm-Bonferroni correction). b Word error rates (WERs) for real-time results and corresponding offline omission simulations from A (***P < 0.0001, P-values provided in Table S3, two-sided Wilcoxon Rank-Sum test with 6-way Holm-Bonferroni correction). c The decoded characters per minute during real-time testing. d The decoded words per minute during real-time testing. In ad, the distribution depicted in each boxplot was computed across n = 34 real-time blocks (in each block, the participant attempted to spell between 2 and 5 sentences), and each boxplot depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range, which are individually plotted). e Number of excess characters in each decoded sentence. f Example sentence-spelling trials with decoded sentences from each non-chance condition. Incorrect letters are colored red. Superscripts 1 and 2 denote the correct target sentence for the two decoded sentences with errors. All other example sentences did not contain any errors. Data to recreate panels ae are provided as a Source Data file.
Fig. 3
Fig. 3. Characterization of high-gamma activity (HGA) and low-frequency signals (LFS) during silent-speech attempts.
a 10-fold cross-validated classification accuracy on silently attempted NATO code words when using HGA alone, LFS alone, and both HGA+LFS simultaneously. Classification accuracy using only LFS is significantly higher than using only HGA, and using both HGA+LFS results in significantly higher accuracy than either feature type alone (**P = 4.71 × 10−4, z = 3.78 for each comparison, two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction). Chance accuracy is 3.7%. Each boxplot corresponds to n = 10 cross-validation folds (which are also plotted as dots) and depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range). be Electrode contributions. Electrodes that appear larger and more opaque provide more important features to the classification model. b, c Show contributions associated with HGA features using a model trained on HGA alone (b) vs using the combined LFS + HGA feature set (c). d, e depict contributions associated with LFS features using a model trained on LFS alone (d) vs the combined LFS + HGA feature set (e). f Histogram of the minimum number of principal components (PCs) required to explain more than 80% of the total variance, denoted as σ2, in the spatial dimension for each feature set over 100 bootstrap iterations. The number of PCs required were significantly different for each feature set (***P < 0.0001, P-values provided in Table S5, two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction). g Histogram of the minimum number of PCs required to explain more than 80% of the variance in the temporal dimension for each feature set over 100 bootstrap iterations (***P < 0.0001, P-values provided in Table S6, two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction, *P < 0.01 two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction). h Effect of temporal smoothing on classification accuracy. Each point represents the median, and error bars represent the 99% confidence interval around bootstrapped estimations of the median. Data to recreate all panels are provided as a Source Data file.
Fig. 4
Fig. 4. Comparison of neural signals during attempts to silently say English letters and NATO code words.
a Classification accuracy (across n = 10 cross-validation folds) using models trained with HGA+LFS features is significantly higher for NATO code words than for English letters (**P = 1.57 × 10−4, z = 3.78, two-sided Wilcoxon Rank-Sum test). The dotted horizontal line represents chance accuracy. b Nearest-class distance is significantly larger for NATO code words than for letters (boxplots show values across the n = 26 code words or letters; *P = 2.85 × 10−3, z = 2.98, two-sided Wilcoxon Rank-Sum test). In a, b, each data point is plotted as a dot, and each boxplot depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range). c The nearest-class distance is greater for the majority of code words than for the corresponding letters. In b and c, nearest-class distances are computed as the Frobenius norm between trial-averaged HGA+LFS features. Data to recreate all panels are provided as a Source Data file.
Fig. 5
Fig. 5. Differences in neural signals and classification performance between overt- and silent-speech attempts.
a MRI reconstruction of the participant’s brain overlaid with implanted electrode locations. The locations of the electrodes used in b and c are bolded and numbered in the overlay. b Evoked high-gamma activity (HGA) during silent (orange) and overt (green) attempts to say the NATO code word kilo. c Evoked high-gamma activity (HGA) during silent (orange) and overt (green) attempts to say the NATO code word tango. Evoked responses in b and c are aligned to the go cue, which is marked as a vertical dashed line at time 0. Each curve depicts the mean ± standard error across n = 100 speech attempts. d Code-word classification accuracy for silent- and overt-speech attempts with various model-training schemes. All comparisons revealed significant differences between the result pairs (P < 0.01, two-sided Wilcoxon Rank-Sum test with 28-way Holm-Bonferroni correction) except for those marked as ‘ns’. Each boxplot corresponds to n = 10 cross-validation folds (which are also plotted as dots) and depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range). Chance accuracy is 3.84%. Data to recreate all panels are provided as a Source Data file.
Fig. 6
Fig. 6. The spelling approach can generalize to larger vocabularies and conversational settings.
a Simulated character error rates from the copy-typing task with different vocabularies, including the original vocabulary used during real-time decoding. b Word error rates from the corresponding simulations in a. In a and b, each boxplot corresponds to n = 34 blocks (in each of these blocks, the participant attempted to spell between two to five sentences). c Character and word error rates across the volitionally chosen responses and messages decoded in real time during the conversational task condition. Each boxplot corresponds to n = 9 blocks (in each of these blocks, the participant attempted to spell between two to four conversational responses; each dot corresponds to a single block). In ac, each boxplot depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range, which are individually plotted). d Examples of presented questions from trials of the conversational task condition (left) along with corresponding responses decoded from the participant’s brain activity (right). In the final example, the participant spelled out his intended message without being prompted with a question. Data to recreate panels ac are provided as a Source Data file.

Similar articles

  • A high-performance neuroprosthesis for speech decoding and avatar control.
    Metzger SL, Littlejohn KT, Silva AB, Moses DA, Seaton MP, Wang R, Dougherty ME, Liu JR, Wu P, Berger MA, Zhuravleva I, Tu-Chan A, Ganguly K, Anumanchipalli GK, Chang EF. Metzger SL, et al. Nature. 2023 Aug;620(7976):1037-1046. doi: 10.1038/s41586-023-06443-4. Epub 2023 Aug 23. Nature. 2023. PMID: 37612505 Free PMC article.
  • Neuroprosthesis for Decoding Speech in a Paralyzed Person with Anarthria.
    Moses DA, Metzger SL, Liu JR, Anumanchipalli GK, Makin JG, Sun PF, Chartier J, Dougherty ME, Liu PM, Abrams GM, Tu-Chan A, Ganguly K, Chang EF. Moses DA, et al. N Engl J Med. 2021 Jul 15;385(3):217-227. doi: 10.1056/NEJMoa2027540. N Engl J Med. 2021. PMID: 34260835 Free PMC article.
  • A high-performance speech neuroprosthesis.
    Willett FR, Kunz EM, Fan C, Avansino DT, Wilson GH, Choi EY, Kamdar F, Glasser MF, Hochberg LR, Druckmann S, Shenoy KV, Henderson JM. Willett FR, et al. Nature. 2023 Aug;620(7976):1031-1036. doi: 10.1038/s41586-023-06377-x. Epub 2023 Aug 23. Nature. 2023. PMID: 37612500 Free PMC article.
  • The speech neuroprosthesis.
    Silva AB, Littlejohn KT, Liu JR, Moses DA, Chang EF. Silva AB, et al. Nat Rev Neurosci. 2024 Jul;25(7):473-492. doi: 10.1038/s41583-024-00819-9. Epub 2024 May 14. Nat Rev Neurosci. 2024. PMID: 38745103 Free PMC article. Review.
  • Development of speech prostheses: current status and recent advances.
    Brumberg JS, Guenther FH. Brumberg JS, et al. Expert Rev Med Devices. 2010 Sep;7(5):667-79. doi: 10.1586/erd.10.34. Expert Rev Med Devices. 2010. PMID: 20822389 Free PMC article. Review.

Cited by

References

    1. Beukelman DR, Fager S, Ball L, Dietz A. AAC for adults with acquired neurological conditions: a review. Augment. Altern. Commun. 2007;23:230–242. doi: 10.1080/07434610701553668. - DOI - PubMed
    1. Felgoise SH, Zaccheo V, Duff J, Simmons Z. Verbal communication impacts quality of life in patients with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. Front. Degener. 2016;17:179–183. doi: 10.3109/21678421.2015.1125499. - DOI - PubMed
    1. Brumberg JS, Pitt KM, Mantie-Kozlowski A, Burnison JD. Brain–computer interfaces for augmentative and alternative communication: a tutorial. Am. J. Speech Lang. Pathol. 2018;27:1–12. doi: 10.1044/2017_AJSLP-16-0244. - DOI - PMC - PubMed
    1. Vansteensel MJ, et al. Fully implanted brain–computer interface in a locked-in patient with ALS. N. Engl. J. Med. 2016;375:2060–2066. doi: 10.1056/NEJMoa1608085. - DOI - PMC - PubMed
    1. Pandarinath C, et al. High performance communication by people with paralysis using an intracortical brain-computer interface. eLife. 2017;6:1–27. doi: 10.7554/eLife.18554. - DOI - PMC - PubMed

Publication types

Associated data