Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 9:S0092-8674(25)00681-6.
doi: 10.1016/j.cell.2025.06.015. Online ahead of print.

Inner speech in motor cortex and implications for speech neuroprostheses

Affiliations

Inner speech in motor cortex and implications for speech neuroprostheses

Erin M Kunz et al. Cell. .

Abstract

Speech brain-computer interfaces (BCIs) show promise in restoring communication to people with paralysis but have also prompted discussions regarding their potential to decode private inner speech. Separately, inner speech may be a way to bypass the current approach of requiring speech BCI users to physically attempt speech, which is fatiguing and can slow communication. Using multi-unit recordings from four participants, we found that inner speech is robustly represented in the motor cortex and that imagined sentences can be decoded in real time. The representation of inner speech was highly correlated with attempted speech, though we also identified a neural "motor-intent" dimension that differentiates the two. We investigated the possibility of decoding private inner speech and found that some aspects of free-form inner speech could be decoded during sequence recall and counting tasks. Finally, we demonstrate high-fidelity strategies that prevent speech BCIs from unintentionally decoding private inner speech.

Keywords: brain-computer interface; covert speech; inner speech; motor cortex; speech neuroprosthesis.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The MGH Translational Research Center has a clinical research support agreement (CRSA) with Axoft, Neuralink, Neurobionics, Paradromics, Precision Neuro, Synchron, and Reach Neuro, for which L.R.H. provides consultative input. L.R.H. is a non-compensated member of the Board of Directors of a nonprofit assistive communication device technology foundation (Speak Your Mind Foundation). Mass General Brigham (MGB) is convening the Implantable Brain-Computer Interface Collaborative Community (iBCI-CC). Charitable gift agreements to MGB, including those received to date from Paradromics, Synchron, Precision Neuro, Neuralink, and Blackrock Neurotech, support the iBCI-CC, for which L.R.H. provides effort. S.D.S. is an inventor on intellectual property licensed by Stanford University to Blackrock Neurotech and Neuralink Corp. He is an advisor to Sonera. He also has equity in Wispr.ai. C.P. is an employee at Meta (Reality Labs). D.M.B. is a surgical consultant for Paradromics Inc. D.M.B. and D.B.R. are principal investigators for the Connexus BCI clinical trial for a Paradromics Inc. clinical product. S.D.S. and D.M.B. are inventors of intellectual property related to speech neuroprostheses owned by the University of California, Davis that has been licensed to a neurotechnology startup. J.M.H. is a consultant for Paradromics, serves on the Medical Advisory Board of Enspire DBS, and is a shareholder in Maplight Therapeutics. He is also the co-founder of Re-EmergeDBS. He is also an inventor on intellectual property licensed by Stanford University to Blackrock Neurotech and Neuralink Corp. F.R.W. is an inventor on intellectual property licensed by Stanford University to Blackrock Neurotech and Neuralink Corp.

Figures

Figure 1:
Figure 1:. Inner speech, perceived speech and silent reading are represented in ventral and mid precentral gyrus
A) To assess tuning to different verbal behaviors, neural activity was recorded during attempted speech, inner speech, reading, and listening for a set of 7 words (Table S1,S2). B) Example trial structure and visual cues shown on the screen for active (attempted or inner speech) and passive (silent reading or listening) behavior conditions. No text was displayed during listening blocks. C) Neural activity was recorded from microelectrode arrays chronically implanted along the precentral gyrus in four participants. A white X indicates that decoding accuracy was not above chance for any behavior (95% confidence interval for accuracy intersected chance) and the array was excluded from further analysis. D) The mean firing rate for each cued word for each behavior is shown for an example electrode channel from each participant (estimated from threshold crossings). Shaded regions indicate 95%CIs. E) Ten-fold cross-validated decoding accuracy is displayed by array and behavior (Gaussian Naive Bayes, 500ms window); red X’s denote that the 95%CI for accuracy intersected chance level (14.3%). Participant arrays that lacked significance for all behaviors were excluded from further analysis and marked with a white X in C. Notably, while T16’s PEF and 6d arrays recorded spiking activity on many electrodes that were not tuned to speech, T17’s 55b arrays recorded very little spiking activity in general. F) Example confusion matrices for T16’s listening trials (92.1% accuracy, 95% CI [86.4%, 96.0%]) and T12’s motoric inner speech trials (72.6% accuracy, 95% CI [65.7%, 78.8%]). See also Figure S1.
Figure 2:
Figure 2:. Inner speech and perceived speech as scaled-down versions of attempted speech in motor cortex
A) Each (i, j) entry in the matrix is the Pearson correlation between the average N × 1 neural feature vectors for word-behavior i and j, where N is the number of neural features from an array. The off-diagonal banding shows that the same word across behaviors is correlated. Similar across-word correlation patterns also suggest that neural geometry is shared among behaviors; for example, “though” and “were” consistently correlate positively, while “though” and “ban” correlate negatively. A cross-validated metric was used to reduce bias. B) Each (i, j) entry represents the correlation of neural representations across all 7 words for behaviors i and j. Since a cross-validated estimator of correlation was used, values can be greater than 1 (see Methods). C) Projections of average word representations into the subspace defined by the top three principal components for attempted vocalized speech visually demonstrate the shared structure and relative sizes of word representations across attempted and inner speech behaviors. Top 3 PC’s captured 75–82% of variance. D) Average neural distances between words within each behavior, normalized to the largest (attempted vocalized speech), represent the modulation magnitude relative to fully attempted speech (e.g., T12-i6v motoric inner speech is about 52% of attempted speech). The pink box highlights the inner speech behavior shown in C.
Figure 3:
Figure 3:. Real-time decoding of self-paced inner speech
A) Neural features were fed into a recurrent neural network (RNN) that outputs probabilities for 39 phonemes and a silence token every 80 ms. These probabilities were decoded via a language model to yield the most likely word sequence, which was then displayed and converted to audio by a text-to-speech algorithm. B) T16 using an inner speech BCI decoding from a large 125k word vocabulary in real-time (Video S1). A text cue appears above the green square and decoded text lies below. C) Example decoded sentences from T16’s inner speech from an evaluation block with an overall word error rate of 52% (95% CI: [42.1%, 61.8%]) for a 125,000-word vocabulary. D) Word error rates during online inner speech decoding for three participants for either a 50-word (blue) or 125,000-word vocabulary. Chance values are indicated by dashed lines, and denote the lower bound (2.5th percentile) of a chance word error rate distribution calculated by shuffling decoded outputs 100 times with respect to ground truth sentences. Error bars indicate 95% confidence intervals determined via bootstrap resampling (10,000 resamples). E) Offline performance of a decoder trained on attempted speech and evaluated on inner speech trials (green bars), compared to baseline decoding performance for attempted speech (dark blue) and inner speech (light blue). Dashed lines show chance levels, and error bars indicate 95% confidence intervals, similarly computed as in D. Note: T12’s outlier cross-decoding error rate is high due to significantly more words being predicted than were cued, including many duplicated words.
Figure 4:
Figure 4:. Uninstructed inner speech elicited by a serial recall task can be decoded from i6v.
A) T12 performed three upper extremity motor tasks with varied cues and memory demands to elicit verbal short-term memory without explicit mental strategy instructions. The 3-element arrows task was designed to prompt verbal memory (eliciting inner speech) for serial recall, while the single-element arrow and 3-element lines tasks served as controls (designed not to elicit inner speech). B) Sequence position decodability was measured by training binary Linear Discriminant Analysis models to classify sequence pairs that differed in one position (e.g., first position: ↑ → ↑ vs ↓ → ↑) using i6v neural activity from a 2-second delay period (pre-go) window. Box plots show cross-validated accuracy (dotted line indicates chance). Only the 3-element arrows task produced significant decoding in all three positions (bootstrap-derived CIs vs chance level of 0.5). C) Two versions of the serial recall task with explicit instructions to use either a verbal or visual short-term memory strategy (mental strategy instructions refer to how to memorize the sequence, while instructed behavior refers to the motor output during recall). D) Same as B but for tasks that only differed in instructed mental strategy. Decoding accuracy significantly increased in all sequence positions when T12 engaged in a verbal memory strategy. Significance assessed via bootstrap-derived confidence intervals of increase in decoding accuracy due to verbal memory instruction, compared to chance level of zero (i.e. no difference between verbal and visual memory). See also Figures S2, S3
Figure 5:
Figure 5:. Neural activity recorded during a counting task can be decoded into a sequence of increasing numbers
A) Neural activity was recorded while participants performed a conjunctive counting task. Participants were instructed to silently count a specified shape and color, and then speak the number aloud during a separate "go" epoch. No specific mental strategy was instructed. B) The neural activity during the "counting" epoch was passed through an inner speech RNN decoder, which was trained on a 125,000-word vocabulary from the same session. Instead of using a standard language model, a unigram language model trained only on number words (one to twenty) was used to generate word-level outputs. This model could only produce number words between one and twenty, and word output was independent of surrounding context, unlike larger models that use language statistics to predict words based on prior and subsequent words. C,D) For T15 and T16, decoded numbers showed a significant positive correlation with their position in the sequence (T15: slope = 0.48, p = 1.69×10⁻⁹; T16: slope = 0.33, p = 4.84×10⁻⁹), indicating sequential increases. Jitter was added along the x-axis for visualization. E,F) Same as C,D except neural activity was recorded during instructed inner speaking of sentences from the Switchboard corpus (collected as training data for the RNN). Lack of significant slopes provides evidence that increasing sequences of numbers in C,D are not likely to be decoded by chance. G) Histogram of slopes obtained from regression analyses of 1,000 resamples of T15’s large-vocabulary Switchboard inner speech sentences, with a red dashed line indicating the slope for the counting task. This shows that numbers decoded from the counting task sequentially increase significantly more than when the same analysis is performed on instructed inner speech trials using the Switchboard corpus H) Same as G, but for T16. See also Figures S4, S5
Figure 6:
Figure 6:. Motor cortex contains a neural dimension representing motor intention that can help distinguish attempted speech from inner speech
A) For T12, PCA projections of all 14 conditions (7 words each for attempted and inner speech) show that the concentric view (left) reveals shared word structure (attempted: solid; inner: dashed), while the rotated view (right) highlights a clear separation along the motor-intent dimension—defined as the vector between each behavior’s centroids (see Methods 8.4). B-D) PCA projections for T15,T16, and T17. E) Cross-validated Euclidean distances reveal that word-related modulation (within behaviors, turquoise/pink) is similar to (T12, T16, T17) or smaller than (T15) the motor-intent modulation (across behaviors, yellow). F) Example confusion matrices for T16 show that removing the motor-intent dimension increases cross-behavior confusion. G) Left: Word accuracy (indicating correct word decoding irrespective of the decoded behavior) remains similar before and after removal of the motor-intent dimension (95% confidence intervals intersect). Right: Behavior accuracy (indicating correct behavior decoding irrespective of the decoded word) drops significantly after removal (non-overlapping 95% confidence intervals for all participants).
Figure 7:
Figure 7:. Simple strategies can robustly prevent private inner speech from being decoded by a speech BCI.
A) The imagery-silenced strategy augments the standard imagery-naive approach (which uses only attempted speech trials) by including inner speech trials labeled as silence. This strategy largely preserves offline decoding performance (measured by word error rate) on attempted speech trials, as indicated by dots (10 RNN training seeds) with 95% confidence intervals. B) Imagery-silenced training robustly prevents false outputs during inner speech. C) Correlations between RNN outputs for matched inner and attempted speech sentences are much higher with imagery-naive training than with imagery-silenced training (dotted lines show chance-level correlations). D) Visualizations of phoneme logit outputs for a T16 sentence illustrate that, in the imagery-naive strategy (left), attempted (top) and inner speech (bottom) produce similar outputs, while the imagery-silenced strategy (right) correctly quiets the output on inner speech trials. E) With the keyword strategy, the inner-speech BCI remains in a "locked" mode and does not decode until the unlocking keyword "chittychittybangbang" is detected. In real-time tests with T12, this approach achieved a keyword detection accuracy of 98.75% and a word error rate of 43.45% (95% CI: [35.7%, 49.4%]) for a 50-word vocabulary.

Similar articles

References

    1. Willett FR, Kunz EM, Fan C, Avansino DT, Wilson GH, Choi EY, Kamdar F, Glasser MF, Hochberg LR, Druckmann S, et al. (2023). A high-performance speech neuroprosthesis. Nature 620, 1031–1036. 10.1038/s41586-023-06377-x. - DOI - PMC - PubMed
    1. Card NS, Wairagkar M, Iacobacci C, Hou X, Singer-Clark T, Willett FR, Kunz EM, Fan C, Vahdati Nia M, Deo DR, et al. (2024). An accurate and rapidly calibrating speech neuroprosthesis. N. Engl. J. Med. 391, 609–618. 10.1056/NEJMoa2314132. - DOI - PMC - PubMed
    1. Metzger SL, Littlejohn KT, Silva AB, Moses DA, Seaton MP, Wang R, Dougherty ME, Liu JR, Wu P, Berger MA, et al. (2023). A high-performance neuroprosthesis for speech decoding and avatar control. Nature 620, 1037–1046. 10.1038/s41586-023-06443-4. - DOI - PMC - PubMed
    1. Soldado-Magraner J, Antonietti A, French J, Higgins N, Young MJ, Larrivee D, and Monteleone R (2024). Applying the IEEE BRAIN neuroethics framework to intra-cortical brain-computer interfaces. J. Neural Eng. 21, 022001. 10.1088/1741-2552/ad3852. - DOI - PubMed
    1. Brown CML (2024). Neurorights, mental privacy, and mind reading. Neuroethics 17. 10.1007/s12152-024-09568-z. - DOI

LinkOut - more resources