Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 3;22(4):046003.
doi: 10.1088/1741-2552/ade402.

Annotating neurophysiologic data at scale with optimized human input

Affiliations

Annotating neurophysiologic data at scale with optimized human input

Zhongchuan Xu et al. J Neural Eng. .

Abstract

Objective.Neuroscience experiments and devices are generating unprecedented volumes of data, but analyzing and validating them presents practical challenges, particularly in annotation. While expert annotation remains the gold standard, it is time consuming to obtain and often poorly reproducible. Although automated annotation approaches exist, they rely on labeled data first to train machine learning algorithms, which limits their scalability. A semi-automated annotation approach that integrates human expertise while optimizing efficiency at scale is critically needed. To address this, we present Annotation Co-pilot, a human-in-the-loop solution that leverages deep active learning (AL) and self-supervised learning (SSL) to improve intracranial EEG (iEEG) annotation, significantly reducing the amount of human annotations.Approach.We automatically annotated iEEG recordings from 28 humans and 4 dogs with epilepsy implanted with two neurodevices that telemetered data to the cloud for analysis. We processed 1500 h of unlabeled iEEG recordings to train a deep neural network using a SSL method Swapping Assignments between View to generate robust, dataset-specific feature embeddings for the purpose of seizure detection. AL was used to select only the most informative data epochs for expert review. We benchmarked this strategy against standard methods.Main result.Over 80 000 iEEG clips, totaling 1176 h of recordings were analyzed. The algorithm matched the best published seizure detectors on two datasets (NeuroVista and NeuroPace responsive neurostimulation) but required, on average, only 1/6 of the human annotations to achieve similar accuracy (area under the ROC curve of 0.9628 ± 0.015) and demonstrated better consistency than human annotators (Cohen's Kappa of 0.95 ± 0.04).Significance. 'Annotation Co-pilot' demonstrated expert-level performance, robustness, and generalizability across two disparate iEEG datasets while reducing annotation time by an average of 83%. This method holds great promise for accelerating basic and translational research in electrophysiology, and potentially accelerating the pathway to clinical translation for AI-based algorithms and devices.

Keywords: active learning; annotation; epilepsy; human-in-the-loop; iEEG; seizure detection; self supervised learning.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Data annotation seizure vs non-seizure clips (a) the number of clips (color shading in squares) annotated as seizure or not seizure for each subject (x-axis) by each of 10 expert reviewers (y-axis left). Scale for shading is labeled on y-axis right on a scale of 0–100. (b) inter-rater agreement between marking experts on 50 segments annotated by all experts shows percentage agreement for each pair of experts.
Figure 2.
Figure 2.
Pipeline overview. The proposed pipeline for iEEG data classification has several stages. (a) Initially, the unlabeled data is segmented into 1 s sliding window, with voltage values converted into grayscale pixels. Following data augmentation, (b) these clips are used to train a ResNet-50 model using self-supervised learning (SwAV) without labels. (c) Active learning is then employed to train an long short-term memory prediction head for final classification, utilizing the previously learned feature representations. In this human-in-the-loop process, the model is iteratively trained on the most informative labeled data, with the algorithm selecting the most important samples for human annotation in each round. The figure (c) shows the importance score of the first iteration with entropy sampling. (d) This iterative cycle continues until satisfactory performance is achieved, resulting in final predictions of non-seizure and seizure annotation. This approach aims to achieve high classification performance with a significantly reduced number of annotations.
Figure 3.
Figure 3.
Performance evaluation of SSL and active learning strategies in iEEG classification (a) shows the linear classification performance of ResNet-50 model backbones trained under various self-supervised learning (SSL) conditions for 1 s EEG sliding window, in comparison to a fully-supervised benchmark. Initially, the ResNet-50 models were trained using SSL methods to learn feature representations from unlabeled data. Subsequently, the model weights were frozen, and a linear classifier was trained on labeled data to classify ictal and interictal states from 1 s iEEG window. For the fully-supervised benchmark, a ResNet-50 model pretrained on ImageNet (utilizing PyTorch’s IMAGENET1K_V2 [37, 38]) with unfrozen weights was directly fine-tuned on the labeled data. (b) shows the classification performance of the full pipeline using different active learning query strategies trained on different numbers of labeled segments. A ResNet-50 model, pretrained on 28 patients with SSL and an LSTM prediction head was used to perform the same task as in the (a). The benchmark of ‘Random Sampling’ simulates randomly annotating iEEG clips using SSL-pretrained weights, while ‘Random Sampling without SSL (Transfer Learning)’ represents randomly annotating samples with ImageNet-pretrained weights. Only the best performing strategies are shown, for the full list of performances see appendix F.
Figure 4.
Figure 4.
Active learning query visualization and prediction analysis (a) feature representations of 1 s iEEG windows are shown with a color-coded scale indicating the importance scores of unlabeled samples. In the active learning framework, samples with the highest importance scores are prioritized for human annotation during each training iteration. Different active learning strategies generate varying importance scores, which are updated in successive training rounds. (b) Ground truth labels of 1 s iEEG windows annotated by a human expert. (c) Ground truth and predicted annotations for example RNS episodes from three patients are shown, visualizing the algorithm’s process for selecting important regions for annotation. Kadane’s algorithm aggregates the importance scores, identifying and prioritizing the continuous regions with the highest overall importance for annotation. More example of classification can be found in appendix G (d) per-patient feature representation distributions are depicted. Examples from three patients with different prediction accuracies are shown to visualize the embedding of misclassified data.
Figure 5.
Figure 5.
A sample of data randomly selected from the NeuroVista dataset. The original NeuroVista data has been reformatted into discrete 1 s sliding windows in the competition. The above panel represents a 1 s sliding window along with corresponding class labels.
Figure 6.
Figure 6.
Data usage of RNS and NeuroVista datasets in this study. For the RNS dataset, unannotated episodes were used for unsupervised pretraining. Then, 80% of the annotated data was used for the active learning training process, and 20% was held out as a test set. A separate test set, annotated by multiple raters, was used for inter-rater agreement analysis. During the retrospective active learning process, new samples were iteratively introduced until convergence. F1 score and AUC are reported for the test data and the unselected training data, while Cohen’s κ is reported for the test set annotated by multiple raters. For the NeuroVista dataset, the entire dataset was used for self-supervised training. Annotated and test splits were defined by the dataset publisher, and performance was evaluated using the metrics specified by the publisher.
Figure 7.
Figure 7.
Randomly selected prediction examples from the RNS dataset. The upper panel presents four types of prediction outcomes—True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)—which have been randomly selected for demonstration.
Figure 8.
Figure 8.
Individual machine-annotator performance on a test set of EEG clips from three patients. The average Cohen’s kappa agreement between machine predictions and human annotations was 0.95 ± 0.04.
Figure 9.
Figure 9.
Classification accuracy on the validation set for the best-performing query strategies applied to the NeuroVista dataset. Similar to the results observed with the RNS dataset, these strategies demonstrate a clear improvement over the random sampling baseline.

Similar articles

References

    1. Haneef Z, Skrehot H C. Neurostimulation in generalized epilepsy: a systematic review and meta-analysis. Epilepsia. 2023;64:811–20. doi: 10.1111/epi.17524. - DOI - PubMed
    1. Stacey W C, Litt B. Technology insight: neuroengineering and epilepsy-designing devices for seizure control. Nat. Clin. Pract. Neurol. 2008;4:190–201. doi: 10.1038/ncpneuro0750. - DOI - PMC - PubMed
    1. Pal Attia T, et al. Epilepsy personal assistant device-a mobile platform for brain state, dense behavioral and physiology tracking and controlling adaptive stimulation. Front. Neurol. 2021;12:704170. doi: 10.3389/fneur.2021.704170. - DOI - PMC - PubMed
    1. Nair D R, et al. Nine-year prospective efficacy and safety of brain-responsive neurostimulation for focal epilepsy. Neurology. 2020;95:e1244–56. doi: 10.1212/WNL.0000000000010154. - DOI - PMC - PubMed
    1. Foutz T J, Wong M. Brain stimulation treatments in epilepsy: basic mechanisms and clinical advances. Biomed. J. 2022;45:27–37. doi: 10.1016/j.bj.2021.08.010. - DOI - PMC - PubMed

LinkOut - more resources