Interrater sleep stage scoring reliability between manual scoring from two European sleep centers and automatic scoring performed by the artificial intelligence-based Stanford-STAGES algorithm
- PMID: 33599203
- PMCID: PMC8314654
- DOI: 10.5664/jcsm.9174
Interrater sleep stage scoring reliability between manual scoring from two European sleep centers and automatic scoring performed by the artificial intelligence-based Stanford-STAGES algorithm
Abstract
Study objectives: The objective of this study was to evaluate interrater reliability between manual sleep stage scoring performed in 2 European sleep centers and automatic sleep stage scoring performed by the previously validated artificial intelligence-based Stanford-STAGES algorithm.
Methods: Full night polysomnographies of 1,066 participants were included. Sleep stages were manually scored in Berlin and Innsbruck sleep centers and automatically scored with the Stanford-STAGES algorithm. For each participant, we compared (1) Innsbruck to Berlin scorings (INN vs BER); (2) Innsbruck to automatic scorings (INN vs AUTO); (3) Berlin to automatic scorings (BER vs AUTO); (4) epochs where scorers from Innsbruck and Berlin had consensus to automatic scoring (CONS vs AUTO); and (5) both Innsbruck and Berlin manual scorings (MAN) to the automatic ones (MAN vs AUTO). Interrater reliability was evaluated with several measures, including overall and sleep stage-specific Cohen's κ.
Results: Overall agreement across participants was substantial for INN vs BER (κ = 0.66 ± 0.13), INN vs AUTO (κ = 0.68 ± 0.14), CONS vs AUTO (κ = 0.73 ± 0.14), and MAN vs AUTO (κ = 0.61 ± 0.14), and moderate for BER vs AUTO (κ = 0.55 ± 0.15). Human scorers had the highest disagreement for N1 sleep (κN1 = 0.40 ± 0.16 for INN vs BER). Automatic scoring had lowest agreement with manual scorings for N1 and N3 sleep (κN1 = 0.25 ± 0.14 and κN3 = 0.42 ± 0.32 for MAN vs AUTO).
Conclusions: Interrater reliability for sleep stage scoring between human scorers was in line with previous findings, and the algorithm achieved an overall substantial agreement with manual scoring. In this cohort, the Stanford-STAGES algorithm showed similar performances to the ones achieved in the original study, suggesting that it is generalizable to new cohorts. Before its integration in clinical practice, future independent studies should further evaluate it in other cohorts.
Keywords: automatic scoring; computerized analysis; deep neural networks; interrater variability; slow wave activity; study of health in Pomerania.
© 2021 American Academy of Sleep Medicine.
Conflict of interest statement
All authors have seen this manuscript and approved its submission. Work for this study was performed at the Department of Neurology, Medical University of Innsbruck. Study of Health in Pomerania is part of the Community Medicine Research Network of the University Medicine Greifswald, which is supported by the German Federal State of Mecklenburg-West Pomerania. Polysomnography assessment was in part supported by the German RLS organization (Deutsche Restless Legs Vereinigung). The authors report no conflicts of interest.
Figures
References
-
- Berry RB, Quan SF, Abreu AR, et al. ; for the American Academy of Sleep Medicine . The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. Version 2.6. Darien, IL: American Academy of Sleep Medicine; 2020.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
