CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics
- PMID: 40554678
- PMCID: PMC12341759
- DOI: 10.1093/sleep/zsaf134
CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics
Abstract
Study objectives: To develop and validate a Complete Artificial Intelligence Sleep Report system (CAISR), a system for comprehensive automated sleep analysis, including sleep staging, arousal detection, apnea identification, and limb movement analysis.
Methods: We utilized a large diverse dataset from four cohorts (MGH, MESA, MrOS, SHHS) comprising 25,749 participants to develop CAISR. Following American Academy of Sleep Medicine (AASM) guidelines, CAISR performs four tasks: it stages sleep into five categories (Wake, NREM 1, NREM 2, NREM 3, REM), detects arousals, detects and classifies breathing events (Obstructive Apnea, Central Apnea, Mixed Apnea, Hypopnea, and RERA), and detects limb movements and categorizes them as periodic or isolated. We tested CAISR against multiple datasets independently annotated by multiple experts, including UPenn (69 subjects, 6 experts), BITS (98 subjects, 3 experts), and Stanford (100 subjects, three experts). Sleep staging and arousal detection were accomplished using customized deep neural networks, while breathing event detection and classification and limb movement analysis were accomplished using rule-based signal processing approaches. We quantified CAISR performance with three metrics: Cohen's Kappa, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). To determine whether CAISR performed on par with human experts, we compared expert inter-rater reliability (IRR) with algorithm-expert IRR.
Results: The CAISR model showed strong overall performance across the four tasks: sleep staging, arousal detection, apnea detection, and limb movement detection. In sleep staging, the model achieved AUROC values ranging from 0.82 to 0.97 and AUPRC values between 0.63 and 0.90 across the BITS, Stanford, and Penn datasets, indicating high classification accuracy. The Kappa agreement analysis showed that in the BITS and Stanford datasets, CAISR outperformed human experts, with non-overlapping confidence intervals indicating superiority (Kappa values around 0.7 to 0.8 for CAISR vs. experts). In the Penn dataset, the model's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority. For arousal detection, the model maintained reliable performance, with AUROC values ranging from 0.83 to 0.94 and AUPRC values from 0.67 to 0.85, and Kappa analysis showing overlapping confidence intervals, indicating comparable performance to experts in both the BITS and Stanford datasets (Kappa values for CAISR around 0.6 to 0.75). In apnea detection, including the detection of obstructive, central, and mixed apnea, the CAISR model achieved competitive results in the BITS dataset with AUROC values between 0.81 and 0.95 and AUPRC values between 0.58 and 0.82, but in the Stanford dataset, it underperformed compared to human experts, as shown by non-overlapping confidence intervals and lower Kappa values (around 0.55 to 0.65). Finally, in limb movement detection, the model demonstrated superior performance in the BITS dataset, with AUROC values of 0.9 to 0.96 and AUPRC values between 0.75 and 0.85, and Kappa analysis indicating significantly higher reliability compared to experts (CAISR Kappa around 0.8, with non-overlapping confidence intervals). In the Stanford dataset, CAISR's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority (Kappa values around 0.65 to 0.7). Overall, the CAISR model consistently exhibited high classification performance and reliability across tasks, often matching or surpassing expert-level performance, with particularly strong results in sleep staging and limb detection.
Conclusions: The CAISR model demonstrated high classification accuracy and reliability across sleep staging, arousal, apnea, and limb movement detection tasks, matching or surpassing human expert performance. Human errors and systematic biases in the annotation of micro-events during sleep, such as arousal and apnea detection, likely contributed to variability in expert performance, while the CAISR model showed more consistent results, reducing the impact of these biases and increasing overall reliability across task.
Keywords: apnea detection; arousal detection; deep learning; few-shot learning; inter-rater reliability; limb movement; rule-based model; sleep staging; transfer learning.
© The Author(s) 2025. Published by Oxford University Press on behalf of Sleep Research Society. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact journals.permissions@oup.com.
Figures






Comment in
-
An important step toward automation of polysomnography analyses.Sleep. 2025 Aug 14;48(8):zsaf147. doi: 10.1093/sleep/zsaf147. Sleep. 2025. PMID: 40577794 Free PMC article. No abstract available.
Similar articles
-
Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.J Clin Sleep Med. 2025 Aug 12. doi: 10.5664/jcsm.11848. Online ahead of print. J Clin Sleep Med. 2025. PMID: 40790924
-
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23. Clin Orthop Relat Res. 2024. PMID: 39051924
-
Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales.NEJM AI. 2025 Jul;2(7):10.1056/aioa2401221. doi: 10.1056/aioa2401221. Epub 2025 Jun 26. NEJM AI. 2025. PMID: 40689158 Free PMC article.
-
Effects of opioid, hypnotic and sedating medications on sleep-disordered breathing in adults with obstructive sleep apnoea.Cochrane Database Syst Rev. 2015 Jul 14;(7):CD011090. doi: 10.1002/14651858.CD011090.pub2. Cochrane Database Syst Rev. 2015. PMID: 26171909
-
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4. Cochrane Database Syst Rev. 2021. Update in: Cochrane Database Syst Rev. 2022 May 23;5:CD011535. doi: 10.1002/14651858.CD011535.pub5. PMID: 33871055 Free PMC article. Updated.
Cited by
-
An important step toward automation of polysomnography analyses.Sleep. 2025 Aug 14;48(8):zsaf147. doi: 10.1093/sleep/zsaf147. Sleep. 2025. PMID: 40577794 Free PMC article. No abstract available.
References
-
- Wetter TC, Collado-Seidel V, Pollmächer T, Yassouridis A, Trenkwalder C. Sleep and periodic leg movement patterns in drug-free patients with Parkinson’s disease and multiple system atrophy. Sleep. 2000;23:361–367. - PubMed
MeSH terms
Grants and funding
- U01 HL053916/HL/NHLBI NIH HHS/United States
- R01 NS131347/NS/NINDS NIH HHS/United States
- R01 AG073410/AG/NIA NIH HHS/United States
- U01 HL053938/HL/NHLBI NIH HHS/United States
- RF1 NS120947/NS/NINDS NIH HHS/United States
- R01 AG073598/AG/NIA NIH HHS/United States
- U01 HL053934/HL/NHLBI NIH HHS/United States
- R24 HL114473/HL/NHLBI NIH HHS/United States
- U01 HL053937/HL/NHLBI NIH HHS/United States
- U01 HL053931/HL/NHLBI NIH HHS/United States
- U01 HL063463/HL/NHLBI NIH HHS/United States
- R01 NS120947/NS/NINDS NIH HHS/United States
- U01 HL064360/HL/NHLBI NIH HHS/United States
- R01 HL161253/HL/NHLBI NIH HHS/United States
- U01 HL053941/HL/NHLBI NIH HHS/United States
- R01 NS130119/NS/NINDS NIH HHS/United States
- R01 NS126282/NS/NINDS NIH HHS/United States
- R01 NS107291/NS/NINDS NIH HHS/United States
LinkOut - more resources
Full Text Sources