Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul-Aug;78(4):1077-1088.
doi: 10.1016/j.jsurg.2021.02.004. Epub 2021 Feb 25.

Crowdsourced Assessment of Surgical Skill Proficiency in Cataract Surgery

Affiliations

Crowdsourced Assessment of Surgical Skill Proficiency in Cataract Surgery

Grace L Paley et al. J Surg Educ. 2021 Jul-Aug.

Abstract

Objective: To test whether crowdsourced lay raters can accurately assess cataract surgical skills.

Design: Two-armed study: independent cross-sectional and longitudinal cohorts.

Setting: Washington University Department of Ophthalmology.

Participants and methods: Sixteen cataract surgeons with varying experience levels submitted cataract surgery videos to be graded by 5 experts and 300+ crowdworkers masked to surgeon experience. Cross-sectional study: 50 videos from surgeons ranging from first-year resident to attending physician, pooled by years of training. Longitudinal study: 28 videos obtained at regular intervals as residents progressed through 180 cases. Surgical skill was graded using the modified Objective Structured Assessment of Technical Skill (mOSATS). Main outcome measures were overall technical performance, reliability indices, and correlation between expert and crowd mean scores.

Results: Experts demonstrated high interrater reliability and accurately predicted training level, establishing construct validity for the modified OSATS. Crowd scores were correlated with (r = 0.865, p < 0.0001) but consistently higher than expert scores for first, second, and third-year residents (p < 0.0001, paired t-test). Longer surgery duration negatively correlated with training level (r = -0.855, p < 0.0001) and expert score (r = -0.927, p < 0.0001). The longitudinal dataset reproduced cross-sectional study findings for crowd and expert comparisons. A regression equation transforming crowd score plus video length into expert score was derived from the cross-sectional dataset (r2 = 0.92) and demonstrated excellent predictive modeling when applied to the independent longitudinal dataset (r2 = 0.80). A group of student raters who had edited the cataract videos also graded them, producing scores that more closely approximated experts than the crowd.

Conclusions: Crowdsourced rankings correlated with expert scores, but were not equivalent; crowd scores overestimated technical competency, especially for novice surgeons. A novel approach of adjusting crowd scores with surgery duration generated a more accurate predictive model for surgical skill. More studies are needed before crowdsourcing can be reliably used for assessing surgical proficiency.

Keywords: Crowdsourcing; cataract surgery; phacoemulsification; surgical assessment; surgical competence.

PubMed Disclaimer

Conflict of interest statement

Declarations of interest: none relevant to this study.

Figures

FIGURE 1.
FIGURE 1.
Crowd rater mean scores correlate with but do not agree with expert scores, and the crowd significantly overestimates surgeon ability for all 3 years of residency training. For the cross-sectional study arm ⊗ (n = 50): A: Reliability of blinded expert raters and crowd raters: Intraclass correlation coefficient (ICC), a measure of rater reliability by agreement as opposed to correlation, is shown for the sum scores of the experts and crowdworkers from the cross-sectional study. When scores are averaged across each group (mean ICC), the crowd performs nearly as well as experts, but the individual crowdworkers perform poorly in comparison with individual experts (individual ICC). B: Expert mean sum scores accurately predicted surgeon level of training, establishing construct validity for the modified OSATS grading rubric (Pearson’s r = 0.860, p < 0.0001). C: Crowd mean sum score also correlates with surgeon level (r = 0.729, p < 0.0001) but not as well as the experts. The crowd used a narrow range of the grading scale weighted toward superior scores for all surgeons with a correspondingly elevated group mean, as compared to the experts (B) who utilized the full scoring range with a group mean close to the mid-range (Average-Maximum-Minimum plots). D: Crowd and expert mean scores were highly correlated (r = 0.865, p<0.0001), but failed to show good absolute value agreement (perfect agreement would plot linearly along y = x). E: Crowd and expert mean scores for individual surgery videos show discordance especially for videos given lower scores by the experts. F: Crowd mean scores were higher than expert scores for first, second, and third year residents (p<0.0001, paired t-test) and approached borderline significance for the PGY5 fellows (p = 0.055, paired t-test). PGY: postgraduate year. Ophthalmology residency begins in PGY2 after a year of general internship training. Table lists group means for each level of surgeon experience. Error bars indicate standard deviation.
FIGURE 2.
FIGURE 2.
Cross-sectional cohort demonstrates that surgery duration correlates with surgeon training level and expert score, and surgery duration improves upon the predictive accuracy of crowd score to approximate expert score. For the cross-sectional study arm ⊗ (n = 50): A: Longer surgery length (as defined by phacoemulsification duration) was strongly correlated with lower training level (r = −0.855, p < 0.001). B: Longer surgery length was strongly correlated with lower expert mean score (r = −0.927, p<0.0001). C-D: A regression equation to convert crowd score plus surgery length into predicted expert score was derived from the cross-sectional data (Predicted Expert Mean = −11.18 + −0.018*video_length_in_seconds + 1.643*crowdscore). This equation generated a predicted score (green markers) that more closely approximated actual expert score (blue markers) (r2 = 0.92) than did crowd score alone (red markers) as illustrated by (C) absolute values and (D) correlation plots (n = 50, r = 0.959, p < 0.0001).
FIGURE 3.
FIGURE 3.
Longitudinal cohort reproduces correlation of expert scores with crowd scores and with surgery duration, and validates predictive model for estimating expert score. For the longitudinal study arm Ⓛ (n = 28): A: Crowd mean scores correlated with expert scores (r = 0.792, p < 0.0001). B: Resident physicians gain surgical experience with increasing case number, as measured by the blinded expert assessments. Similar to the cross-sectional study, crowd scores do not agree with expert scores and demonstrate significant over-estimation of skill for beginner-intermediate surgeons as compared to expert scores (30th-120th cases: p < 0.05, paired t-test) leading to higher averaged mean scores and a constricted grading range as compared to expert scores. Table lists group means for each level of surgeon experience. Error bars indicate standard deviation. C: Surgery length (as defined by phacoemulsification duration) inversely correlates with resident experience (r = −0.827, p < 0.0001). Hashed lines connect time points separated by missing data points to show overall trends. D: Longer surgery duration (mean of residents) was strongly correlated with lower expert mean sum score (r = −0.845, p < 0.0001). E-F: When applied to the independent longitudinal data set, the regression equation derived from the cross-sectional data set yields a predicted score that closely approximates the expert score (r2 = 0.80) as illustrated by (E) absolute values and (F) correlation plots (n = 28, r = 0.896, p < 0.0001). Hashed lines connect time points separated by missing data points to show overall trends.

Comment in

References

    1. O’Day DM. Assessing surgical competence in ophthalmology training programs. Arch Ophthalmol. 2007;125:395–396. - PubMed
    1. Pellegrini VD Jr., Ferguson PC, Cruess R, et al. Sufficient competence to enter the unsupervised practice of orthopaedics: what is it, when does it occur, and do we know it when we see it? AOA critical issues. J Bone Joint Surg Am. 2015;97:1459–1464. - PMC - PubMed
    1. Gedde SJ, Volpe NJ, Feuer WJ, Binenbaum G. Ophthalmology resident surgical competence: a survey of program directors. Ophthalmology. 2020;127:1123–1125. Epub Feb 20. - PubMed
    1. Cremers SL, Lora AN, Ferrufino-Ponce ZK. Global Rating Assessment of Skills in Intraocular Surgery (GRASIS). Ophthalmology. 2005;112:1655–1660. - PubMed
    1. Saleh GM, Gauba V, Mitra A, et al. Objective structured assessment of cataract surgical skill. Arch Ophthalmol. 2007;125:363–366. - PubMed

Publication types