Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2008 Oct;35(10):4404-9.
doi: 10.1118/1.2977766.

Binary and multi-category ratings in a laboratory observer performance study: a comparison

Affiliations
Comparative Study

Binary and multi-category ratings in a laboratory observer performance study: a comparison

David Gur et al. Med Phys. 2008 Oct.

Abstract

The authors investigated radiologists, performances during retrospective interpretation of screening mammograms when using a binary decision whether to recall a woman for additional procedures or not and compared it with their receiver operating characteristic (ROC) type performance curves using a semi-continuous rating scale. Under an Institutional Review Board approved protocol nine experienced radiologists independently rated an enriched set of 155 examinations that they had not personally read in the clinic, mixed with other enriched sets of examinations that they had individually read in the clinic, using both a screening BI-RADS rating scale (recall/not recall) and a semi-continuous ROC type rating scale (0 to 100). The vertical distance, namely the difference in sensitivity levels at the same specificity levels, between the empirical ROC curve and the binary operating point were computed for each reader. The vertical distance averaged over all readers was used to assess the proximity of the performance levels under the binary and ROC-type rating scale. There does not appear to be any systematic tendency of the readers towards a better performance when using either of the two rating approaches, namely four readers performed better using the semi-continuous rating scale, four readers performed better with the binary scale, and one reader had the point exactly on the empirical ROC curve. Only one of the nine readers had a binary "operating point" that was statistically distant from the same reader's empirical ROC curve. Reader-specific differences ranged from -0.046 to 0.128 with an average width of the corresponding 95% confidence intervals of 0.2 and p-values ranging for individual readers from 0.050 to 0.966. On average, radiologists performed similarly when using the two rating scales in that the average distance between the run in individual reader's binary operating point and their ROC curve was close to zero. The 95% confidence interval for the fixed-reader average (0.016) was (-0.0206, 0.0631) (two-sided p-value 0.35). In conclusion the authors found that in retrospective observer performance studies the use of a binary response or a semi-continuous rating scale led to consistent results in terms of performance as measured by sensitivity-specificity operating points.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Reader-specific empirical ROC curves (dashed curves), binary operating points (small dots), average binary operating point (large dot), and the empirical ROC curve for the pooled set of ratings. Each operating point (small dot) is connected to the corresponding point on the empirical ROC curve with a vertical line. The lengths of the vertical segments correspond to the absolute value of the distances (∣TPF0-TPF∣FPF0∣). A point being above the related ROC curve (the vertical segment extending downward from the point) results in a positive value of the signed distance (TPF0-TPF∣FPF0>0). The solid curve depicts the empirical ROC curve for the pooled ratings of all readers. This curve is shown solely for illustration purposes and was not used in the actual analysis. The vertical distance between the average binary operating point and the pooled ROC curve is 0.0093 and the average of the vertical distances actually used in the analysis is 0.0159.

References

    1. Awai K., Murao K., Ozawa A., Nakayama Y., Nakaura T., Liu D., Kawanaka K., Funama Y., Morishita S., and Yamashita Y., “Pulmonary nodules: estimation of malignancy at thin-section helical CT--effect of computer-aided diagnosis on performance of radiologists,” Radiology RADLAX10.1148/radiol.2383050167 239(1), 276–284 (2006). - DOI - PubMed
    1. Monnier-Cholley L., Carrat F., Cholley B. P., Tubiana J. M., and Arrivé L., “Detection of lung cancer on radiographs: receiver operating characteristic analyses of radiologists’, pulmonologists’, and anesthesiologists’ performance,” Radiology RADLAX 233(3), 799–805 (2004). - PubMed
    1. Fenlon H. M., Tello R., deCarvalho V. L., and Yucel E. K., “Signal characteristics of focal liver lesions on double echo T2-weighted conventional spin echo MRI: observer performance versus quantitative measurements of T2 relaxation times,” J. Comput. Assist. Tomogr. JCATD5 24(2), 204–211 (2000). - PubMed
    1. Fultz P. J., Jacobs C. V., Hall W. J., Gottlieb R., Rubens D., Totterman S. M., Meyers S., Angel C., Del Priore G., Warshal D. P., Zou K. H., and Shapiro D. E., “Ovarian cancer: comparison of observer performance for four methods of interpreting CT scans,” Radiology RADLAX 212(2), 401–410 (1999). - PubMed
    1. Slasky B. S., Gur D., Good W. F., Costa-Greco M. A., Harris K. M., Cooperstein L. A., and Rockette H. E., “Receiver operating characteristic analysis of chest image interpretation with conventional, laser-printed, and high-resolution workstation images,” Radiology RADLAX 174(3 Pt 1), 775–780 (1990). - PubMed

Publication types

MeSH terms