Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 14;48(8):zsaf134.
doi: 10.1093/sleep/zsaf134.

CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics

Affiliations

CAISR: achieving human-level performance in automated sleep analysis across all clinical sleep metrics

Samaneh Nasiri et al. Sleep. .

Abstract

Study objectives: To develop and validate a Complete Artificial Intelligence Sleep Report system (CAISR), a system for comprehensive automated sleep analysis, including sleep staging, arousal detection, apnea identification, and limb movement analysis.

Methods: We utilized a large diverse dataset from four cohorts (MGH, MESA, MrOS, SHHS) comprising 25,749 participants to develop CAISR. Following American Academy of Sleep Medicine (AASM) guidelines, CAISR performs four tasks: it stages sleep into five categories (Wake, NREM 1, NREM 2, NREM 3, REM), detects arousals, detects and classifies breathing events (Obstructive Apnea, Central Apnea, Mixed Apnea, Hypopnea, and RERA), and detects limb movements and categorizes them as periodic or isolated. We tested CAISR against multiple datasets independently annotated by multiple experts, including UPenn (69 subjects, 6 experts), BITS (98 subjects, 3 experts), and Stanford (100 subjects, three experts). Sleep staging and arousal detection were accomplished using customized deep neural networks, while breathing event detection and classification and limb movement analysis were accomplished using rule-based signal processing approaches. We quantified CAISR performance with three metrics: Cohen's Kappa, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). To determine whether CAISR performed on par with human experts, we compared expert inter-rater reliability (IRR) with algorithm-expert IRR.

Results: The CAISR model showed strong overall performance across the four tasks: sleep staging, arousal detection, apnea detection, and limb movement detection. In sleep staging, the model achieved AUROC values ranging from 0.82 to 0.97 and AUPRC values between 0.63 and 0.90 across the BITS, Stanford, and Penn datasets, indicating high classification accuracy. The Kappa agreement analysis showed that in the BITS and Stanford datasets, CAISR outperformed human experts, with non-overlapping confidence intervals indicating superiority (Kappa values around 0.7 to 0.8 for CAISR vs. experts). In the Penn dataset, the model's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority. For arousal detection, the model maintained reliable performance, with AUROC values ranging from 0.83 to 0.94 and AUPRC values from 0.67 to 0.85, and Kappa analysis showing overlapping confidence intervals, indicating comparable performance to experts in both the BITS and Stanford datasets (Kappa values for CAISR around 0.6 to 0.75). In apnea detection, including the detection of obstructive, central, and mixed apnea, the CAISR model achieved competitive results in the BITS dataset with AUROC values between 0.81 and 0.95 and AUPRC values between 0.58 and 0.82, but in the Stanford dataset, it underperformed compared to human experts, as shown by non-overlapping confidence intervals and lower Kappa values (around 0.55 to 0.65). Finally, in limb movement detection, the model demonstrated superior performance in the BITS dataset, with AUROC values of 0.9 to 0.96 and AUPRC values between 0.75 and 0.85, and Kappa analysis indicating significantly higher reliability compared to experts (CAISR Kappa around 0.8, with non-overlapping confidence intervals). In the Stanford dataset, CAISR's performance was comparable to experts, with overlapping confidence intervals suggesting non-inferiority (Kappa values around 0.65 to 0.7). Overall, the CAISR model consistently exhibited high classification performance and reliability across tasks, often matching or surpassing expert-level performance, with particularly strong results in sleep staging and limb detection.

Conclusions: The CAISR model demonstrated high classification accuracy and reliability across sleep staging, arousal, apnea, and limb movement detection tasks, matching or surpassing human expert performance. Human errors and systematic biases in the annotation of micro-events during sleep, such as arousal and apnea detection, likely contributed to variability in expert performance, while the CAISR model showed more consistent results, reducing the impact of these biases and increasing overall reliability across task.

Keywords: apnea detection; arousal detection; deep learning; few-shot learning; inter-rater reliability; limb movement; rule-based model; sleep staging; transfer learning.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Workflow of the Complete Artificial Intelligence Sleep Report System (CAISR). This figure illustrates the comprehensive operational process of CAISR. The system integrates various physiological sleep signals, including EEG, EOG, EMG, and respiratory signals, to perform sleep staging, arousal detection, apnea identification, and limb movement analysis. The flowchart highlights the data processing pipeline, starting from raw PSG data input, through preprocessing and feature extraction, to the application of deep neural networks and rule-based algorithms. CAISR’s predictions are then validated against expert (gold) and super-expert (platinum) annotations, ensuring robust performance across diverse datasets.
Figure 2.
Figure 2.
Summary of CAISR results across all cohorts and tasks (A) Kappa for CAISR sleep staging across all cohorts, showing the distribution for each subject within each cohort. Boxplots depict data distribution, with boxes extending from the first (Q1) to the third quartile (Q3), a median line, whiskers up to 1.5x the interquartile range (IQR), and fliers beyond the whiskers. (B) CAISR results across all cohorts, detailing performance in detecting arousal events (C) CAISR results across all cohorts for detecting apnea events (D) CAISR results across all cohorts for detecting limb movements. Bar plots show median ICC values with 95% confidence intervals, comparing CAISR performance against experts. The hatched-filled bars represent results against the platinum labels.
Figure 3.
Figure 3.
Inter-rater reliability analysis results for sleep staging. Top: ROC and precision-recall curves for the CAISR sleep staging model and six experts across different sleep stages (Wake, REM, N1, N2, N3). The CAISR model achieves high AUC values (0.82 to 0.97) and AUC-PR values (0.63 to 0.9), indicating robust performance. Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different sleep per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.
Figure 4.
Figure 4.
Inter-rater reliability analysis results for arousal event detection. Top: ROC and precision-recall curves for the CAISR arousal event detection model, overall and stratified by sleep stage. Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different arousal classes per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.
Figure 5.
Figure 5.
Inter-rater reliability analysis results for respiratory event detection. Top: ROC and precision-recall curves for the CAISR respiratory event detection model per respiratory class (obstructive apnea, central apnea, mixed apnea, hypopnea, RERA). Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different apnea classes per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.
Figure 6.
Figure 6.. Summary results limb movement detection.
Top: ROC and precision-recall curves for the CAISR limb movement event detection model per limb movement class (limb movement and no-limb movement). Bottom: Barplot summarizing the median Kappa agreement between rater pairs across different limb movement classes per dataset. 95% confidence intervals were computed using 10,000 bootstrap samples. Non-overlapping CIs indicate superiority of either the AI model or the expert, while overlapping CIs suggest non-inferiority of the AI model compared to the expert.

Comment in

Similar articles

Cited by

References

    1. Foster RG. Sleep, circadian rhythms and health. Interface Focus 2020;10(20190098):20190098. doi: 10.1098/rsfs.2019.0098 - DOI - PMC - PubMed
    1. Krueger JM, Rector DM, Roy S, Van Dongen HPA, Belenky G, Panksepp J. Sleep as a fundamental property of neuronal assemblies. Nat Rev Neurosci. 2008;9:910–919. doi: 10.1038/nrn2521 - DOI - PMC - PubMed
    1. Luyster FS, Strollo PJ, Zee PC, Walsh JK; Boards of Directors of the American Academy of Sleep Medicine and the Sleep Research Society. Sleep: A Health Imperative. Sleep. 2012;35:727–734. doi: 10.5665/sleep.1846 - DOI - PMC - PubMed
    1. Wetter TC, Collado-Seidel V, Pollmächer T, Yassouridis A, Trenkwalder C. Sleep and periodic leg movement patterns in drug-free patients with Parkinson’s disease and multiple system atrophy. Sleep. 2000;23:361–367. - PubMed
    1. Baranwal N, Yu PK, Siegel NS. Sleep physiology, pathophysiology, and sleep hygiene. Prog Cardiovasc Dis. 2023;77:59–69. doi: 10.1016/j.pcad.2023.02.005 - DOI - PubMed