Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy

Jonathan Krause¹, Varun Gulshan¹, Ehsan Rahimy², Peter Karth³, Kasumi Widner¹, Greg S Corrado¹, Lily Peng⁴, Dale R Webster¹

Affiliations

¹ Google Research, Google Inc., Mountain View, California.
² Department of Ophthalmology, Palo Alto Medical Foundation, Palo Alto, California.
³ Oregon Eye Consultants, Eugene, Oregon.
⁴ Google Research, Google Inc., Mountain View, California. Electronic address: lhpeng@google.com.

PMID: 29548646
DOI: 10.1016/j.ophtha.2018.01.034

Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy

Jonathan Krause et al. Ophthalmology. 2018 Aug.

. 2018 Aug;125(8):1264-1272.

doi: 10.1016/j.ophtha.2018.01.034. Epub 2018 Mar 13.

Authors

Jonathan Krause¹, Varun Gulshan¹, Ehsan Rahimy², Peter Karth³, Kasumi Widner¹, Greg S Corrado¹, Lily Peng⁴, Dale R Webster¹

Affiliations

¹ Google Research, Google Inc., Mountain View, California.
² Department of Ophthalmology, Palo Alto Medical Foundation, Palo Alto, California.
³ Oregon Eye Consultants, Eugene, Oregon.
⁴ Google Research, Google Inc., Mountain View, California. Electronic address: lhpeng@google.com.

PMID: 29548646
DOI: 10.1016/j.ophtha.2018.01.034

Abstract

Purpose: Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading.

Design: Retrospective analysis.

Participants: Retinal fundus images from DR screening programs.

Methods: Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard.

Main outcome measures: For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity.

Results: Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR.

Conclusions: Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.

PubMed Disclaimer

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- ClinicalKey
- Elsevier Science
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy

Affiliations

Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy

Authors

Affiliations

Abstract

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical