. 2025 Aug 14;48(8):zsaf134.

doi: 10.1093/sleep/zsaf134.

CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics

Samaneh Nasiri^{1

2

3

4}, Wolfgang Ganglberger^{1

3}, Thijs Nassi^{1

3

5}, Erik-Jan Meulenbrugge^{1

3}, Valdery Moura Junior^{2

3}, Manohar Ghanta^{2

3}, Aditya Gupta^{2

3}, Katie L Stone^{6

7}, Magnus Ruud Kjaer⁸, Oliver Sum-Ping⁸, Emmanuel Mignot⁸, Dennis Hwang⁹, Lynn Marie Trotti¹⁰, Gari D Clifford^{4

11}, Umakanth Katwa^{3

12}, Haoqi Sun^{1

3}, Robert J Thomas^{3

13}, M Brandon Westover^{1

3}

Affiliations

¹ Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
² Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
³ Department of Neurology, Harvard Medical School, Boston, MA, USA.
⁴ Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA.
⁵ Cardiovascular and Respiratory Physiology Group, University of Twente, Enschede, NL, USA.
⁶ Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
⁷ Epidemiology and Biostatistics, California Pacific Medical Center Research Institute, San Francisco, CA, USA.
⁸ School of Medicine, Stanford University, Palo Alto, CA, USA.
⁹ Kaiser Permanente, San Bernardino County Sleep Disorders Center. San Bernardino, CA, USA.
¹⁰ Department of Neurology and Emory Sleep Center, Emory University School of Medicine, Atlanta, GA, USA.
¹¹ Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
¹² Division of Sleep Medicine, Boston Children's Hospital, Boston, MA, USA and.
¹³ Department of Medicine, Division of Pulmonary Critical Care & Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.

PMID: 40554678
PMCID: PMC12341759
DOI: 10.1093/sleep/zsaf134

CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics

Samaneh Nasiri et al. Sleep. 2025.

. 2025 Aug 14;48(8):zsaf134.

doi: 10.1093/sleep/zsaf134.

Authors

Affiliations

¹ Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
² Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
³ Department of Neurology, Harvard Medical School, Boston, MA, USA.
⁴ Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA.
⁵ Cardiovascular and Respiratory Physiology Group, University of Twente, Enschede, NL, USA.
⁶ Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
⁷ Epidemiology and Biostatistics, California Pacific Medical Center Research Institute, San Francisco, CA, USA.
⁸ School of Medicine, Stanford University, Palo Alto, CA, USA.
⁹ Kaiser Permanente, San Bernardino County Sleep Disorders Center. San Bernardino, CA, USA.
¹⁰ Department of Neurology and Emory Sleep Center, Emory University School of Medicine, Atlanta, GA, USA.
¹¹ Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
¹² Division of Sleep Medicine, Boston Children's Hospital, Boston, MA, USA and.
¹³ Department of Medicine, Division of Pulmonary Critical Care & Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.

PMID: 40554678
PMCID: PMC12341759
DOI: 10.1093/sleep/zsaf134

Abstract

Study objectives: To develop and validate a Complete Artificial Intelligence Sleep Report system (CAISR), a system for comprehensive automated sleep analysis, including sleep staging, arousal detection, apnea identification, and limb movement analysis.

Methods: We utilized a large diverse dataset from four cohorts (MGH, MESA, MrOS, SHHS) comprising 25,749 participants to develop CAISR. Following American Academy of Sleep Medicine (AASM) guidelines, CAISR performs four tasks: it stages sleep into five categories (Wake, NREM 1, NREM 2, NREM 3, REM), detects arousals, detects and classifies breathing events (Obstructive Apnea, Central Apnea, Mixed Apnea, Hypopnea, and RERA), and detects limb movements and categorizes them as periodic or isolated. We tested CAISR against multiple datasets independently annotated by multiple experts, including UPenn (69 subjects, 6 experts), BITS (98 subjects, 3 experts), and Stanford (100 subjects, three experts). Sleep staging and arousal detection were accomplished using customized deep neural networks, while breathing event detection and classification and limb movement analysis were accomplished using rule-based signal processing approaches. We quantified CAISR performance with three metrics: Cohen's Kappa, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). To determine whether CAISR performed on par with human experts, we compared expert inter-rater reliability (IRR) with algorithm-expert IRR.

Results: The CAISR model showed strong overall performance across the four tasks: sleep staging, arousal detection, apnea detection, and limb movement detection. In sleep staging, the model achieved AUROC values ranging from 0.82 to 0.97 and AUPRC values between 0.63 and 0.90 across the BITS, Stanford, and Penn datasets, indicating high classification accuracy. The Kappa agreement analysis showed that in the BITS and Stanford datasets, CAISR outperformed human experts, with non-overlapping confidence intervals indicating superiority (Kappa values around 0.7 to 0.8 for CAISR vs. experts). In the Penn dataset, the model's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority. For arousal detection, the model maintained reliable performance, with AUROC values ranging from 0.83 to 0.94 and AUPRC values from 0.67 to 0.85, and Kappa analysis showing overlapping confidence intervals, indicating comparable performance to experts in both the BITS and Stanford datasets (Kappa values for CAISR around 0.6 to 0.75). In apnea detection, including the detection of obstructive, central, and mixed apnea, the CAISR model achieved competitive results in the BITS dataset with AUROC values between 0.81 and 0.95 and AUPRC values between 0.58 and 0.82, but in the Stanford dataset, it underperformed compared to human experts, as shown by non-overlapping confidence intervals and lower Kappa values (around 0.55 to 0.65). Finally, in limb movement detection, the model demonstrated superior performance in the BITS dataset, with AUROC values of 0.9 to 0.96 and AUPRC values between 0.75 and 0.85, and Kappa analysis indicating significantly higher reliability compared to experts (CAISR Kappa around 0.8, with non-overlapping confidence intervals). In the Stanford dataset, CAISR's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority (Kappa values around 0.65 to 0.7). Overall, the CAISR model consistently exhibited high classification performance and reliability across tasks, often matching or surpassing expert-level performance, with particularly strong results in sleep staging and limb detection.

Conclusions: The CAISR model demonstrated high classification accuracy and reliability across sleep staging, arousal, apnea, and limb movement detection tasks, matching or surpassing human expert performance. Human errors and systematic biases in the annotation of micro-events during sleep, such as arousal and apnea detection, likely contributed to variability in expert performance, while the CAISR model showed more consistent results, reducing the impact of these biases and increasing overall reliability across task.

Keywords: apnea detection; arousal detection; deep learning; few-shot learning; inter-rater reliability; limb movement; rule-based model; sleep staging; transfer learning.

© The Author(s) 2025. Published by Oxford University Press on behalf of Sleep Research Society. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact journals.permissions@oup.com.

PubMed Disclaimer

Figures

**Figure 1.**
Workflow of the Complete Artificial Intelligence Sleep Report System (CAISR). This figure illustrates the comprehensive operational process of CAISR. The system integrates various physiological sleep signals, including EEG, EOG, EMG, and respiratory signals, to perform sleep staging, arousal detection, apnea identification, and limb movement analysis. The flowchart highlights the data processing pipeline, starting from raw PSG data input, through preprocessing and feature extraction, to the application of deep neural networks and rule-based algorithms. CAISR’s predictions are then validated against expert (gold) and super-expert (platinum) annotations, ensuring robust performance across diverse datasets.

**Figure 2.**
Summary of CAISR results across all cohorts and tasks **(A)** Kappa for CAISR sleep staging across all cohorts, showing the distribution for each subject within each cohort. Boxplots depict data distribution, with boxes extending from the first (Q1) to the third quartile (Q3), a median line, whiskers up to 1.5x the interquartile range (IQR), and fliers beyond the whiskers. **(B)** CAISR results across all cohorts, detailing performance in detecting arousal events **(C)** CAISR results across all cohorts for detecting apnea events **(D)** CAISR results across all cohorts for detecting limb movements. Bar plots show median ICC values with 95% confidence intervals, comparing CAISR performance against experts. The hatched-filled bars represent results against the platinum labels.

**Figure 3.**
Inter-rater reliability analysis results for sleep staging. Top: ROC and precision-recall curves for the CAISR sleep staging model and six experts across different sleep stages (Wake, REM, N1, N2, N3). The CAISR model achieves high AUC values (0.82 to 0.97) and AUC-PR values (0.63 to 0.9), indicating robust performance. Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different sleep per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.

**Figure 4.**
Inter-rater reliability analysis results for arousal event detection. Top: ROC and precision-recall curves for the CAISR arousal event detection model, overall and stratified by sleep stage. Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different arousal classes per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.

**Figure 5.**
Inter-rater reliability analysis results for respiratory event detection. Top: ROC and precision-recall curves for the CAISR respiratory event detection model per respiratory class (obstructive apnea, central apnea, mixed apnea, hypopnea, RERA). Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different apnea classes per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.

**Figure 6.. Summary results limb movement detection.**
Top: ROC and precision-recall curves for the CAISR limb movement event detection model per limb movement class (limb movement and no-limb movement). Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different limb movement classes per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.

See this image and copyright information in PMC

Comment in

An important step toward automation of polysomnography analyses.
Cesari M, Brink-Kjaer A, Rechichi I. Cesari M, et al. Sleep. 2025 Aug 14;48(8):zsaf147. doi: 10.1093/sleep/zsaf147. Sleep. 2025. PMID: 40577794 Free PMC article. No abstract available.

Cited by

An important step toward automation of polysomnography analyses.
Cesari M, Brink-Kjaer A, Rechichi I. Cesari M, et al. Sleep. 2025 Aug 14;48(8):zsaf147. doi: 10.1093/sleep/zsaf147. Sleep. 2025. PMID: 40577794 Free PMC article. No abstract available.

References

1. Foster RG. Sleep, circadian rhythms and health. Interface Focus 2020;10(20190098):20190098. doi: 10.1098/rsfs.2019.0098 - DOI - PMC - PubMed
1. Krueger JM, Rector DM, Roy S, Van Dongen HPA, Belenky G, Panksepp J. Sleep as a fundamental property of neuronal assemblies. Nat Rev Neurosci. 2008;9:910–919. doi: 10.1038/nrn2521 - DOI - PMC - PubMed
1. Luyster FS, Strollo PJ, Zee PC, Walsh JK; Boards of Directors of the American Academy of Sleep Medicine and the Sleep Research Society. Sleep: A Health Imperative. Sleep. 2012;35:727–734. doi: 10.5665/sleep.1846 - DOI - PMC - PubMed
1. Wetter TC, Collado-Seidel V, Pollmächer T, Yassouridis A, Trenkwalder C. Sleep and periodic leg movement patterns in drug-free patients with Parkinson’s disease and multiple system atrophy. Sleep. 2000;23:361–367. - PubMed
1. Baranwal N, Yu PK, Siegel NS. Sleep physiology, pathophysiology, and sleep hygiene. Prog Cardiovasc Dis. 2023;77:59–69. doi: 10.1016/j.pcad.2023.02.005 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics

Affiliations

CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics

Authors

Affiliations

Abstract

Figures

Comment in

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Comment in

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources