Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 1;17(6):1237-1247.
doi: 10.5664/jcsm.9174.

Interrater sleep stage scoring reliability between manual scoring from two European sleep centers and automatic scoring performed by the artificial intelligence-based Stanford-STAGES algorithm

Affiliations

Interrater sleep stage scoring reliability between manual scoring from two European sleep centers and automatic scoring performed by the artificial intelligence-based Stanford-STAGES algorithm

Matteo Cesari et al. J Clin Sleep Med. .

Abstract

Study objectives: The objective of this study was to evaluate interrater reliability between manual sleep stage scoring performed in 2 European sleep centers and automatic sleep stage scoring performed by the previously validated artificial intelligence-based Stanford-STAGES algorithm.

Methods: Full night polysomnographies of 1,066 participants were included. Sleep stages were manually scored in Berlin and Innsbruck sleep centers and automatically scored with the Stanford-STAGES algorithm. For each participant, we compared (1) Innsbruck to Berlin scorings (INN vs BER); (2) Innsbruck to automatic scorings (INN vs AUTO); (3) Berlin to automatic scorings (BER vs AUTO); (4) epochs where scorers from Innsbruck and Berlin had consensus to automatic scoring (CONS vs AUTO); and (5) both Innsbruck and Berlin manual scorings (MAN) to the automatic ones (MAN vs AUTO). Interrater reliability was evaluated with several measures, including overall and sleep stage-specific Cohen's κ.

Results: Overall agreement across participants was substantial for INN vs BER (κ = 0.66 ± 0.13), INN vs AUTO (κ = 0.68 ± 0.14), CONS vs AUTO (κ = 0.73 ± 0.14), and MAN vs AUTO (κ = 0.61 ± 0.14), and moderate for BER vs AUTO (κ = 0.55 ± 0.15). Human scorers had the highest disagreement for N1 sleep (κN1 = 0.40 ± 0.16 for INN vs BER). Automatic scoring had lowest agreement with manual scorings for N1 and N3 sleep (κN1 = 0.25 ± 0.14 and κN3 = 0.42 ± 0.32 for MAN vs AUTO).

Conclusions: Interrater reliability for sleep stage scoring between human scorers was in line with previous findings, and the algorithm achieved an overall substantial agreement with manual scoring. In this cohort, the Stanford-STAGES algorithm showed similar performances to the ones achieved in the original study, suggesting that it is generalizable to new cohorts. Before its integration in clinical practice, future independent studies should further evaluate it in other cohorts.

Keywords: automatic scoring; computerized analysis; deep neural networks; interrater variability; slow wave activity; study of health in Pomerania.

PubMed Disclaimer

Conflict of interest statement

All authors have seen this manuscript and approved its submission. Work for this study was performed at the Department of Neurology, Medical University of Innsbruck. Study of Health in Pomerania is part of the Community Medicine Research Network of the University Medicine Greifswald, which is supported by the German Federal State of Mecklenburg-West Pomerania. Polysomnography assessment was in part supported by the German RLS organization (Deutsche Restless Legs Vereinigung). The authors report no conflicts of interest.

Figures

Figure 1
Figure 1. Schematic overview of the automatic sleep stage scoring with the Stanford-STAGES algorithm.
From the EDF files, the C3A2, C4A1, O2A1, and O1A2 electroencephalographic channels were extracted, as well as the electromyographic chin channel and the left and right electrooculographic channels. The algorithm automatically selected which of the 2 central and occipital channels to use. Then, the signals were resampled at 100 Hz, filtered between 0.2 and 49 Hz, and encoded with cross-correlation. The encoded signals were given in input to the deep neural network. For each 15-second segment, the network returned the probabilities that such segment was wakefulness (p(W)), N1 sleep (p(N1)), N2 sleep (p(N2)), N3 sleep (p(N3)), and rapid eye movement sleep (p(REM)). The figure reports an example epoch for which the obtained probability values are shown. For each 30-second sleep epoch, the average values of probabilities across the two 15-second segments were calculated, thus obtaining the values of probabilities for each sleep epoch. The hypnodensity was obtained as the graphical representation of the sleep stage probabilities for each sleep epoch. From the hypnodensity, the hypnogram was built by scoring each sleep epoch as the sleep stage with the highest probability. EDF = European data format; EOGL = electrooculogram left; EOGR = electrooculogram right; N1 = non-REM stage 1 sleep; N2 = non-REM stage 2 sleep; N3 = non-REM stage 3 sleep; REM = rapid eye movement sleep; W = wakefulness.
Figure 2
Figure 2. Visual comparison of hypnograms and hypnodensity for the same polysomnographic recording.
(A) Manual hypnogram scored in Innsbruck. (B) Manual hypnogram scored in Berlin. (C) Hypnogram obtained by applying the automatic Stanford-STAGES algorithm. For each epoch, the sleep stage assigned was the one having the highest probability in the hypnodensity (D). Color codes for probabilities in the hypnodensity: white, W; red, N1; light blue, N2; dark blue, N3; black, REM. W = wakefulness, REM = rapid eye movement sleep, N1 = non-REM stage 1 sleep, N2 = non-REM stage 2 sleep, N3 = non-REM stage 3 sleep.
Figure 3
Figure 3. Hypnograms and relative confusion matrix.
(Left) Hypnograms for the same PSG recording scored by human scorers in Innsbruck (A) and Berlin (B) are shown. The confusion matrix (C) reports in the diagonal the number of epochs for which the scorers were in agreement and out of the diagonal the number of epochs for which there was disagreement and the type of disagreement (eg, the element in {row 1, column 2} indicates that 10 epochs were scored as W in the INN hypnogram but as N1 in the BER hypnogram). BER = Berlin, INN = Innsbruck, PSG = polysomnography, W = wakefulness, REM = rapid eye movement sleep, N1 = non-REM stage 1 sleep, N2 = non-REM stage 2 sleep, N3 = non-REM stage 3 sleep.
Figure 4
Figure 4. Row-wise normalized confusion matrices across all participants.
The values are shown as mean and standard deviation across the participants. For each matrix element, a darker color represents a higher agreement. (A) INN vs BER: comparison of manual hypnograms scored in Innsbruck and Berlin. (B) INN vs AUTO: comparison of manual hypnograms scored in Innsbruck to the automatic ones. (C) BER vs AUTO: comparison of manual hypnograms scored in Berlin to the automatic ones. (D) CONS vs AUTO: comparison of the epochs where manual scorers from Innsbruck and Berlin were in consensus to the respective epochs automatically scored. (E) MAN vs AUTO: comparison of both manual hypnograms to the automatic one (in case of disagreement between manual scorers, an epoch was equally weighted between the 2 manually scored stages). As an example to interpret these row-wise CMs, in A, the element in {row 1, column 1} indicates that 81 ± 16% of the epochs scored as W in Innsbruck were also scored as W in Berlin. Similarly, the element in {row 1, column 2} indicates that 16 ± 15% of the epochs scored as W in Innsbruck were scored as N1 in Berlin. W = wakefulness, REM = rapid eye movement sleep, N1 = non-REM stage 1 sleep, N2 = non-REM stage 2 sleep, N3 = non-REM stage 3 sleep.
Figure 5
Figure 5. Example of sleep epochs where manual scorers agreed to score N3 sleep.
(A) Epoch that was correctly manually scored in both centers as N3 sleep (slow wave activity covers 24% of the epoch), but was scored as N2 by the algorithm. (B) Epoch that was wrongly scored as N3 by human scorers in the 2 sleep centers (17% of the epoch contains slow wave activity) but correctly scored as N2 by the algorithm. (C) Epoch correctly scored by the algorithm and the human scorers as N3 sleep (42% of the epoch has slow wave activity). For each electroencephalographic channel, the red lines are drawn at −37.5 and +37.5 µV to highlight the amplitude of 75-µV peak-to-peak amplitude of slow waves. N2 = non–rapid eye movement (NREM) stage 2 sleep, N3 = NREM stage 3 sleep.

References

    1. Berry RB, Quan SF, Abreu AR, et al. ; for the American Academy of Sleep Medicine . The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. Version 2.6. Darien, IL: American Academy of Sleep Medicine; 2020.
    1. Rosenberg RS , Van Hout S . The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring . J Clin Sleep Med . 2013. ; 9 ( 1 ): 81 – 87 . 10.5664/jcsm.2350 - DOI - PMC - PubMed
    1. Danker-Hopfe H , Anderer P , Zeitlhofer J , et al . Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard . J Sleep Res . 2009. ; 18 ( 1 ): 74 – 84 . 10.1111/j.1365-2869.2008.00700.x - DOI - PubMed
    1. Zhang X , Dong X , Kantelhardt JW , et al . Process and outcome for international reliability in sleep scoring . Sleep Breath . 2015. ; 19 ( 1 ): 191 – 195 . 10.1007/s11325-014-0990-0 - DOI - PubMed
    1. Deng S , Zhang X , Zhang Y , et al . Interrater agreement between American and Chinese sleep centers according to the 2014 AASM standard . Sleep Breath . 2019. ; 23 ( 2 ): 719 – 728 . 10.1007/s11325-019-01801-x - DOI - PubMed

Publication types