Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 15;12(6):885-94.
doi: 10.5664/jcsm.5894.

Staging Sleep in Polysomnograms: Analysis of Inter-Scorer Variability

Affiliations

Staging Sleep in Polysomnograms: Analysis of Inter-Scorer Variability

Magdy Younes et al. J Clin Sleep Med. .

Abstract

Study objectives: To determine the reasons for inter-scorer variability in sleep staging of polysomnograms (PSGs).

Methods: Fifty-six PSGs were scored (5-stage sleep scoring) by 2 experienced technologists, (first manual, M1). Months later, the technologists edited their own scoring (second manual, M2) based upon feedback from the investigators that highlighted differences between their scoring. The PSGs were then scored with an automatic system (Auto) and the technologists edited them, epoch-by-epoch (Edited-Auto). This resulted in 6 different manual scores for each PSG. Epochs were classified as scorer errors (one M1 score differed from the other 5 scores), scorer bias (all 3 scores of each technologist were similar, but differed from the other technologist) and equivocal (sleep scoring was inconsistent within and between technologists).

Results: Percent agreement after M1 was 78.9% ± 9.0% and was unchanged after M2 (78.1% ± 9.7%) despite numerous edits (≈40/PSG) by the scorers. Agreement in Edited-Auto was higher (86.5% ± 6.4%, p < 1E-9). Scorer errors (< 2% of epochs) and scorer bias (3.5% ± 2.3% of epochs) together accounted for < 20% of M1 disagreements. A large number of epochs (92 ± 44/PSG) with scoring agreement in M1 were subsequently changed in M2 and/or Edited-Auto. Equivocal epochs, which showed scoring inconsistency, accounted for 28% ± 12% of all epochs, and up to 76% of all epochs in individual patients. Disagreements were largely between awake/NREM, N1/N2, and N2/N3 sleep.

Conclusion: Inter-scorer variability is largely due to epochs that are difficult to classify. Availability of digitally identified events (e.g., spindles) or calculated variables (e.g., depth of sleep, delta wave duration) during scoring may greatly reduce scoring variability.

Keywords: PSG; automated scoring; inter-observer variability; sleep stages.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Scoring patterns within the 4 manual and 2 post-edit scores in individual 30-s epochs.
The left 3 cells within each bar represent the scoring assigned to the epoch by one scorer in the first manual (S1M1), second manual (S1M2), and post-edit scores (S1Post), respectively, and the right cells show the scores of the second scorer (S2) in the same sequence. Different shades indicate that different sleep stages were assigned but do not reflect any specific sleep stages.
Figure 2
Figure 2. Flow chart describing the evolution of different scoring patterns.
The numbers refer to the frequency of agreement/disagreement in an average PSG containing 732 epochs. Joined twin columns show the scoring of the 2 technologists at the different stages of the analysis. Similar shades in the twin columns indicate scoring agreement between the 2 technologists, while different shades indicate disagreement. Following the first manual scoring (M1) there were 149 epochs with disagreement and 583 epochs with agreement. The numbers above the columns in the Auto panels indicate the frequency of the Auto score being similar to, or different from, the scores in one or both of the earlier manual scoring. In the Auto panel to the right a black Auto column reflects disagreement with the common manual score. In the left Auto panel, the gray column represents disagreement with the earlier manual scores since black and white scores occurred in the manual phases. The frequency of each pattern is given in the first row of numbers below the Edited-Auto patterns. The last row of numbers gives the frequency of one scorer (left number) or both scorers (right number) altering the Auto score within each pattern.
Figure 3
Figure 3. Two examples of equivocal epochs.
C4/A1, C3/A2 and O1/A2 are electroencephalography electrodes (120 μV calibration bar common to all); EOG, electroculogram; S1 and S2, first and second scorers; ORP, odds ratio product. (A) In the first manual round, S1 scored the epoch as NREM sleep stage 1 (N1) while S2 scored it as N2. In the second manual round S1 changed the stage to N2 while S2 changed the stage to awake (W). The automatic system (Auto) scored the epoch as awake based primarily on a high average ORP (2.0). Neither scorer corrected the Auto score resulting in a common score of W. It is difficult to determine whether the duration of the awake pattern in this epoch is more or less than 15 seconds; hence the difficulty of distinguishing W from NREM sleep. There is a brief period of high EEG frequency which may or not be a spindle; hence the difficulty of distinguishing N1 from N2. (B) The EEG in this epoch could visually be either awake or asleep. Whether the epoch is scored W, N1, N2, or REM depends on whether one scores the eye movement as slow or rapid, whether the brief high frequency bursts are considered spindles or brief beta bursts (subthreshold arousal), and whether the chin EMG is low or high for REM. All these features are questionable in this epoch; hence the 3 different assigned stages in the first 2 manual sessions. Auto scored the epoch as N2 because the ORP was closer to the definite sleep level (average ORP 1.33), the eye movement was too slow, and because the high frequency events were confirmed as spindles. Nonetheless, both scorers over-ruled Auto even though N2 was scored twice before manually.
Figure 4
Figure 4. Actions taken by technologists when they changed the Auto score of equivocal epochs.
In each case only one technologist changed the Auto score while the other accepted it. The columns in the first and second rows show the scoring in the first and second manual stages (M1 and M2) of the same scorer who changed Auto. In patterns A to C, the technologist scored the same stage in both M1 and M2, while in D to F the 2 scores were different. In pattern A, the Auto score was different to the common manual score of this technologist (although it was similar in most cases to the other technologist). The technologist changed the Auto score to his/her earlier manual score, suggesting consistency. However, in an approximately equal number of epochs the change in scoring did not reflect consistency (patterns B to F).
Figure 5
Figure 5. Relationship between frequency of equivocal epochs in different polysomnograms and the percent agreement between the 2 scorers.
M1, first manual scoring.

References

    1. Ferri R, Ferri P, Colognola RM, Petrella MA, Musumeci SA, Bergonzi P. Comparison between the results of an automatic and a visual scoring of sleep EEG recordings. Sleep. 1989;12:354–62. - PubMed
    1. Whitney CW, Gottlieb DJ, Redline S, et al. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep. 1998;21:749–57. - PubMed
    1. Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000;23:901–8. - PubMed
    1. Collop NA. Scoring variability between polysomnography technologists in different sleep laboratories. Sleep Med. 2002;3:43–7. - PubMed
    1. Danker-Hopfe H, Kunz D, Gruber G, et al. Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res. 2004;13:63–9. - PubMed

Publication types