Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep;66(9):3411-3425.
doi: 10.1111/epi.18487. Epub 2025 Jun 4.

Harvard Electroencephalography Database: A comprehensive clinical electroencephalographic resource from four Boston hospitals

Affiliations

Harvard Electroencephalography Database: A comprehensive clinical electroencephalographic resource from four Boston hospitals

Chenxi Sun et al. Epilepsia. 2025 Sep.

Abstract

Objective: This article presents the Harvard Electroencephalography Database (HEEDB), a large-scale, deidentified, and standardized electroencephalographic (EEG) resource supporting artificial intelligence-driven and reproducible research in epilepsy and broader clinical neuroscience.

Methods: HEEDB aggregates more than 280 000 EEG recordings from more than 108 000 patients across four Harvard-affiliated hospitals. Data are harmonized using the Brain Imaging Data Structure and hosted on the Brain Data Science Platform. EEG data are linked with clinical notes, International Classification of Diseases, 10th Revision codes, medications, and EEG reports. Deidentification follows Health Insurance Portability and Accountability Act Safe Harbor standards.

Results: The database includes routine, epilepsy monitoring unit, and intensive care unit EEGs across all age groups, with 73% linked to deidentified clinical reports and 96% of those matched to recordings. Findings are extracted using expert curation, regular expressions, and medical natural language processing models. Auxiliary data include diagnoses, medications, and hospital course, supporting multimodal analysis.

Significance: HEEDB fills a critical gap in EEG data availability for epilepsy research. By enabling large-scale, privacy-compliant, and clinically relevant analysis, it accelerates the development of diagnostic tools, improves training datasets for machine learning, and promotes data-sharing in alignment with FAIR (Findable, Accessible, Interoperable, Reusable) and National Institutes of Health data policies.

Keywords: AI for neurology; Data‐driven EEG analysis; Deidentified clinical data; EEG data platform; EEG large‐scale database.

PubMed Disclaimer

Conflict of interest statement

M.B.W. is a cofounder of and consultant to Beacon Biosignals, with personal equity, and receives royalties from Wolters Kluwer and Demos Medical. D.M.G. is an unpaid advisor for Epilepsy AI and Eysz, and a paid advisor for Magic Leap. He has received speaker fees from AAN, AES, ACNS, NNS, and AI in Epilepsy and Neurology, and served as a consultant for Neuro Event Labs, IDR, LivaNova, and Health Advances. T.L. is an inventor on patents and patent applications related to the detection, prediction, management, and treatment of epilepsy and seizures; has received device donations from Epitel and Empatica; has received travel support from academic and scientific organizations; and hosts international fellows. C.T.S. and B.G. are employed by and hold equity in Amazon Web Services. J.R. is the founder of the Global Brain Care Coalition and cofounder of McCance for Brain Health, has consulted for the NFL and Eli Lilly, and holds leadership roles at Columbia University, The European Stroke Journal, and The Lancet Neurology. The remaining authors report no conflicts of interest. We confirm that we have read the Journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.

Figures

FIGURE 1
FIGURE 1
Demographics. (A) Age distribution of patients. (B) Race and ethnicity distribution of patients. (C) Findings distribution of electroencephalographic (EEG) recordings. In each plot, the stacked bars show contributions to the overall counts from each hospital. In panels A and B, statistics are presented at the patient level, whereas in panel C, they are based on individual EEG recordings. BCH, Boston Children's Hospital; BECTS, Benign Epilepsy with Centrotemporal Spikes; BETS, Benign Epileptiform Transients of Sleep; BIDMC, Beth Israel Deaconess Medical Center; BIPD, Bilateral Independent Periodic Discharges; BWH, Brigham and Women's Hospital; CJD, Creutzfeldt‐Jakob Disease; ESES, Electrical Status Epilepticus in Sleep; GPD, generalized periodic discharge; GRDA, generalized rhythmic delta activity; JEA, Juvenile Absence Epilepsy; JME; LPD, lateralized periodic discharge; LRDA, lateralized rhythmic delta activity; MGH, Massachusetts General Hospital; PDR, Posterior Dominant Rhythm; POSTS, Positive Occipital Sharp Transients of Sleep; PPR, Photoparoxysmal Response.
FIGURE 2
FIGURE 2
Hierarchical organization of electroencephalographic (EEG) data in the Brain Imaging Data Structure (BIDS) format. The BIDS format (version 1.7.0) organizes EEG data into four hierarchical levels to ensure interoperability and clarity across datasets. At the root level, general metadata files (e.g., dataset_description.json, participants.json, participants.tsv, and a README) provide high‐level information about the dataset and participants. The subject level groups data by unique participant identifiers (e.g., sub‐SiteIdPatientId), containing all data related to a specific participant. Within each subject folder, the session level arranges EEG recordings chronologically into subfolders (e.g., ses‐01), representing individual recording sessions. The EEG data level includes essential files such as the raw EEG data files (.edf), annotations (_annotations.tsv), channel descriptions (_channels.tsv), and additional metadata (_eeg.json). This structure ensures consistency and facilitates efficient data analysis and retrieval.
FIGURE 3
FIGURE 3
Medical diagnostic codes. (A) The number of ICD‐10 codes per patient in intervals of 200. (B) The number of unique International Classification of Diseases, 10th Revision (ICD‐10) types per patient in intervals of 50. (C) Counts of how many patients have ICD‐10 codes within key categories (ICD‐10, Clinical Modification for Neurology). In each plot, the stacked bars show contributions to the overall counts from each hospital. ATC, Anatomical Therapeutic Chemical; BCH, Boston Children's Hospital; BIDMC, Beth Israel Deaconess Medical Center; BWH, Brigham and Women's Hospital; MGH, Massachusetts General Hospital.
FIGURE 4
FIGURE 4
Medications. (A) The number of medications per patient in intervals of 50. (B) The number of unique medication types per patient in intervals of 20. (C) Counts of how many patients have been prescribed medications within key categories (Anatomical Therapeutic Chemical classification system). In each plot, the stacked bars show contributions to the overall counts from the four contributing hospitals. BCH, Boston Children's Hospital; BIDMC, Beth Israel Deaconess Medical Center; BWH, Brigham and Women's Hospital; ICD, International Classification of Diseases; MGH, Massachusetts General Hospital.
FIGURE 5
FIGURE 5
Seizure detection tool. (A) Spectrogram of an 8‐h continuous electroencephalographic (EEG) recording. The four panels depict the average spectral power over the left lateral (LL), right lateral (RL), left paracentral (LP), and right paracentral (RP) regions. The spectrogram shows a narrowband monotonous pattern of delta activity, interrupted at semiregular intervals by 28 distinct flame‐shaped electrographic seizures. Automated detections of seizures by a neural network model (SPaRCNet) are indicated in red in the color bar below. (B) A close‐up of one of the seizures shown in panel A. The spectrogram is demonstrated over 10 min. A 20‐s sample of the raw EEG is shown in bipolar montage, centered at the location indicated in the spectrogram by the vertical dashed line and green triangle. GPD, generalized periodic discharge; GRDA, generalized rhythmic delta activity; LPD, lateralized periodic discharge; LRDA, lateralized rhythmic delta activity; SN, SPaRCNet, Seizure, Periodic and Rhythmic Continuum pattens Deep Neural Network.
FIGURE 6
FIGURE 6
Epileptiform discharge detection tool. An automatically detected bifrontal spike‐and‐wave epileptiform discharge is shown. The common average referential montage displays the electroencephalogram (EEG) over 15 s. The output of the SpikeNet 1.0 algorithm is shown above the EEG. The red overbar and transparent pink vertical bar indicate the period when the probability exceeds the detection threshold (theta = .42).

References

    1. Margolis R, Derr L, Dunn M, Huerta M, Larkin J, Sheehan J, et al. The National Institutes of Health's big data to knowledge (BD2K) initiative: capitalizing on biomedical big data. J Am Med Inform Assoc. 2014;21:957–958. - PMC - PubMed
    1. Ahalt S, Avillach P, Boyles R, Bradford K, Cox S, Davis‐Dusenbery B, et al. Building a collaborative cloud platform to accelerate heart, lung, blood, and sleep research. J Am Med Inform Assoc. 2023;30:1293–1300. - PMC - PubMed
    1. NOT‐OD‐21‐013: Final NIH Policy for Data Management and Sharing. Available from: https://grants.nih.gov/grants/guide/notice‐files/NOT‐OD‐21‐013.html
    1. Cloud Life Sciences public datasets | Cloud Life Sciences Documentation. Google Cloud. Available from: https://cloud.google.com/life‐sciences/docs/resources/public‐datasets
    1. Open Data Sponsorship Program | AWS. Amazon Web Services, Inc. Available from: https://aws.amazon.com/opendata/open‐data‐sponsorship‐program/.