Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Feb 6:2025.02.02.634313.
doi: 10.1101/2025.02.02.634313.

A Generalist Intracortical Motor Decoder

Affiliations

A Generalist Intracortical Motor Decoder

Joel Ye et al. bioRxiv. .

Abstract

Mapping the relationship between neural activity and motor behavior is a central aim of sensorimotor neuroscience and neurotechnology. While most progress to this end has relied on restricting complexity, the advent of foundation models instead proposes integrating a breadth of data as an alternate avenue for broadly advancing downstream modeling. We quantify this premise for motor decoding from intracortical microelectrode data, pretraining an autoregressive Transformer on 2000 hours of neural population spiking activity paired with diverse motor covariates from over 30 monkeys and humans. The resulting model is broadly useful, benefiting decoding on 8 downstream decoding tasks and generalizing to a variety of neural distribution shifts. However, we also highlight that scaling autoregressive Transformers seems unlikely to resolve limitations stemming from sensor variability and output stereotypy in neural datasets. Code: https://github.com/joel99/ndt3.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A. NDT3 is a deep network for decoding intracortical spiking activity into low-dimensional time series for various motor effectors. B. We aggregate decoding performance on downstream tasks with variable amounts of data (from Fig. 11). Pretrained NDT3 models reliably outperform from-scratch models and linear baselines, up to 1.5 hrs of downstream data.
Figure 2.
Figure 2.. NDT3 Data and Model Design:
A. NDT3 models paired neural spiking activity and behavioral covariate timeseries. We plot the distribution of 2000 hours of pretraining data by subjects (top) and covariate dimensionality (bottom). B. Examples of the neural and behavioral data for each of the three types of behavioral covariates in pretraining: kinematics, EMG (electromyography), or forces. Not all modeled dimensions in data are meaningfully task-related (right, grey behavior). C. Neural spiking activity is tokenized in time by binning the number of spikes every 20 ms, and in “space” using patches of channels (usually 32), as in NDT2 (Ye et al., 2023). Behavior is low-dimensional in our data, so we use 1 token per behavior dimension, also per 20 ms timestep. NDT3 also pretrains on data from BCI control, which we annotate with two additional tokens. The phase token indicates whether the user is controlling or observing the behavior and the reward token indicates if the BCI task was completed. D. NDT3 models tokens in a single flat stream with linear readins and readouts. Every real-world timestep (shown by the blue cutout) yields several tokens, which we order to allow causal decoding at inference-time. At inference, we omit return and phase tokens and zero-mask behavior tokens.
Figure 3.
Figure 3.. Evaluation on diverse motor tasks:
A single legend and color scheme is used throughout. A. Test-split pretraining R2 compared for 3 models. All model pretraining data includes 1.5 hours of calibration data for the test dataset. We compare a model with just this data (Test dataset only) vs using 200 hours of additional data either from the test monkey or from over 10 other monkeys. Only the additional test monkey data improves over the calibration model. Models terminate at different points due to early stopping. B. Pretraining R2 for models with up to 2000 hours (2 khr) of pretraining data. The 2 khr model degrades in performance vs the 200 hr model at 45M parameters and merely maintains performance at 350M parameters. C. Examples of good and bad data-scaling in downstream multiscale evaluation on two datasets. The bottom right text shows time in each evaluation session and total time in each dataset. The x-axis scales this full dataset down by random subsampling. Shading shows standard deviation on 3 tuning seeds. Increasing pretraining data yields performance gains at all downstream scales in the 4D task, but effects are unclear in the self-paced reach task. ▼ indicate outliers clipped for clarity. D. Downstream performance averaged for 31 settings comprised of different downstream datasets and scales, for different NDT3s and baselines. 45M NDT3s improve with data from 1.5 hrs to 200 hrs but saturate at 2 khrs. Increasing model size to 350M parameters enables further gain. E. p-values computed from FDR-corrected pairwise t-tests for each pair of models. The 350M 2 khr NDT3 significantly outperforms other pretrained NDT3s, except the 350M 200 hr NDT3, and is the only model to do so. NDT2s omitted for brevity, see Fig. 13. F. Per-task performance, normalized by the 350M 200 hr NDT3 performance, is shown against task time for different NDT3 models. Each vertical band shows models trained on the same evaluation setting, e.g. dashed lines show the evaluations from the self-paced reaching dataset. Model variability vanishes by 1.5 hours.
Figure 4.
Figure 4.. NDT3 fails in certain novel input and output configurations.
A: Cross-session transfer persists after pretraining, but cross-subject does not. We test NDT3 on one evaluation session from a monkey self-paced reaching dataset. Training uses 1 minute from the evaluation session and additional data from other sessions (Cross-Session) with the same monkey or from sessions from a different monkey for the same behavior (Cross-Subject). Train/Test bars at bottom show how inputs are arranged in each setting. Single bars correspond to different data channels or dimensions, with color marking a constant physical electrode source. Squares show how channels are grouped into tokens. B: Shuffling inputs ablates cross-session data to resemble cross-subject transfer. uses the same cross-session neural data but permutes input dimensions (recording channels). Shuffle channel randomly permutes inputs, half-token shift rolls channels so that each channel i uses data from i +16, and shuffle token permutes data patchwise, keeping channels from the same patch together. Channel shuffling and half-token shifts both are sufficient to reduce cross-session transfer to the same level as cross-subject transfer. All panels show the baseline performance achieved by the model with just 1 minute of test-session data, x-axis shows additional cross-context data provided. C. We study angular extrapolation in an isometric monkey dataset where exerted forces are mapped to cursor positions in 8 different angles. We use three held-in and five held-out angles. Behavior is cleanly separated across conditions. The neural data for each condition can also be visualized separably by projecting them to 2D plane computed by combining PCA and LDA. D. Predictions derived either by fitting a Wiener Filter to the projected neural data (Linear, PCA-lDA), or from NDT3 (Scratch, 350M 2khr). While the linear model generalizes to held-out angles, NDT3 predictions are restricted to held-in ranges. E. Pretraining quantifiably improves over from-scratch in all conditions, but far underperforms the generalization of PCA-LDA.
Figure 5.
Figure 5.. Generalizability of pretraining gains.
A. Models fine-tuned in one distribution of data are evaluated in-distribution (ID) and out-of-distribution (OOD). Top plots show the distribution across channels of neural firing rates from OOD and ID trials, normalized by average ID firing rates. Lower plots scatter OOD vs ID performance, with each point being a single model with different hyperparameters. The time shift uses two human cursor datasets collected one hour apart. Models were tuned in each block and were evaluated in the second block. Pose shift uses a monkey center-out reach task which was performed with the hand starting in different locations in the workspace. Spring Load uses a dataset of monkey 1D finger motion with or without spring force feedback. B. Models are evaluated on a human open-loop cursor dataset prepared in two ways. Trialized training receives inputs according to trial boundaries, varying from 2–4 seconds in length. Continuous training receives random 1 second snippets (that can cross trial boundaries). Trialized evaluation matches trialized training, and continuous evaluation is done by streaming up to 1 second of history. ▼ indicates points below 0.0. Continuously trained models perform well in both evaluation settings, while models trained on trialized data fail in continuous evaluation. C. Multiscale fine-tuning performance of NDT3 on datasets recorded outside motor cortex, namely S1 (Somatosensory) and FEF/MT (Oculomotor).

References

    1. Abnar S., Dehghani M., Neyshabur B., and Sedghi H.. Exploring the limits of large scale pre-training, 2021. URL https://arxiv.org/abs/2110.02095.
    1. Adelson E. H., Bergen J. R., et al. The plenoptic function and the elements of early vision, volume 2. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of; …, 1991.
    1. Aghajanyan A., Yu L., Conneau A., Hsu W.-N., Hambardzumyan K., Zhang S., Roller S., Goyal N., Levy O., and Zettlemoyer L.. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
    1. Antonello R., Vaidya A., and Huth A. G.. Scaling laws for language encoding models in fmri, 2024. URL https://arxiv.org/abs/2305.11863. - PMC - PubMed
    1. Azabou M., Arora V., Ganesh V., Mao X., Nachimuthu S., Mendelson M., Richards B., Perich M., Lajoie G., and Dyer E.. A unified, scalable framework for neural population decoding. Advances in Neural Information Processing Systems, 36, 2024.

Publication types

LinkOut - more resources