. 2024 May 20;9(1):31.

doi: 10.1186/s41235-024-00558-6.

Boosting wisdom of the crowd for medical image annotation using training performance and task features

Eeshan Hasan^{1

2}, Erik Duhaime³, Jennifer S Trueblood^{4

5}

Affiliations

¹ Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA. eehasan@iu.edu.
² Cognitive Science Program, Indiana University, Bloomington, USA. eehasan@iu.edu.
³ Centaur Labs, Boston, USA.
⁴ Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA. jstruebl@iu.edu.
⁵ Cognitive Science Program, Indiana University, Bloomington, USA. jstruebl@iu.edu.

PMID: 38763994
PMCID: PMC11102897
DOI: 10.1186/s41235-024-00558-6

Boosting wisdom of the crowd for medical image annotation using training performance and task features

Eeshan Hasan et al. Cogn Res Princ Implic. 2024.

. 2024 May 20;9(1):31.

doi: 10.1186/s41235-024-00558-6.

Authors

Eeshan Hasan^{1

2}, Erik Duhaime³, Jennifer S Trueblood^{4

5}

Affiliations

¹ Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA. eehasan@iu.edu.
² Cognitive Science Program, Indiana University, Bloomington, USA. eehasan@iu.edu.
³ Centaur Labs, Boston, USA.
⁴ Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA. jstruebl@iu.edu.
⁵ Cognitive Science Program, Indiana University, Bloomington, USA. jstruebl@iu.edu.

PMID: 38763994
PMCID: PMC11102897
DOI: 10.1186/s41235-024-00558-6

Abstract

A crucial bottleneck in medical artificial intelligence (AI) is high-quality labeled medical datasets. In this paper, we test a large variety of wisdom of the crowd algorithms to label medical images that were initially classified by individuals recruited through an app-based platform. Individuals classified skin lesions from the International Skin Lesion Challenge 2018 into 7 different categories. There was a large dispersion in the geographical location, experience, training, and performance of the recruited individuals. We tested several wisdom of the crowd algorithms of varying complexity from a simple unweighted average to more complex Bayesian models that account for individual patterns of errors. Using a switchboard analysis, we observe that the best-performing algorithms rely on selecting top performers, weighting decisions by training accuracy, and take into account the task environment. These algorithms far exceed expert performance. We conclude by discussing the implications of these approaches for the development of medical AI.

PubMed Disclaimer

Conflict of interest statement

Erik Duhaime is the CEO a stakeholder in Centaur Labs. Eeshan Hasan and Jennifer Trueblood do not hold any stakes in Centaur Labs and have no competing interests.

Figures

**Fig. 1**
The panels on the left and the middle show the distribution of mean accuracy of different individuals for the test and train images, respectively, across all images. The chance accuracy is calculated as 1/7 since there were 7 different classes. The panel on the right shows the relationship between the accuracy of an individual and the number of responses provided by the individual

**Fig. 2**
The top two panels show the confusion matrices when we pool the decisions from all individuals for the test and the train set, respectively. The bottom two panels show the confusion matrices for training for the individuals that provided the most and second most responses on the train set

**Fig. 3**
The performance of simple voting on the different metrics based on the size of the crowd. The left panel shows the accuracy and balanced accuracy metrics and the right panel shows the mean ROC-AUC and malignant ROC-AUC. The 95% bootstrapped confidence intervals are depicted as transparent bands around the line

**Fig. 4**
The results of the algorithms based on selecting the top individuals using their training performance. The 95% bootstrapped confidence intervals are depicted as transparent bands around the line

**Fig. 5**
Inter-algorithm disagreement rate for accuracy weighting algorithms

**Fig. 6**
Switchboard analysis of all the different algorithms

See this image and copyright information in PMC

Cited by

Human-AI collectives most accurately diagnose clinical vignettes.
Zöller N, Berger J, Lin I, Fu N, Komarneni J, Barabucci G, Laskowski K, Shia V, Harack B, Chu EA, Trianni V, Kurvers RHJM, Herzog SM. Zöller N, et al. Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122. Epub 2025 Jun 13. Proc Natl Acad Sci U S A. 2025. PMID: 40512795 Free PMC article.

References

1. Afflerbach P, van Dun C, Gimpel H, Parak D, Seyfried J. A simulation-based approach to understanding the wisdom of crowds phenomenon in aggregating expert judgment. Business & Information Systems Engineering. 2021;63:329–348. doi: 10.1007/s12599-020-00664-x. - DOI
1. Alialy R, Tavakkol S, Tavakkol E, Ghorbani-Aghbologhi A, Ghaffarieh A, Kim S-H, Shahabi C. A review on the applications of crowdsourcing in human pathology. Journal of pathology informatics. 2018;9(1):2. doi: 10.4103/jpi.jpi_65_17. - DOI - PMC - PubMed
1. Allen J, Arechar A-A, Pennycook G, Rand D-G. Scaling up fact-checking using the wisdom of crowds. Science Advances. 2021;7(36):eabf4393. doi: 10.1126/sciadv.abf4393. - DOI - PMC - PubMed
1. Armstrong, J-S. (2001). Combining forecasts. Principles of forecasting: a handbook for researchers and practitioners, J. Scott Armstrong, ed., Norwell, MA: Kluwer Academic Publishers.
1. Atanasov, P. & Himmelstein, M. (2023). Talent spotting in crowd prediction. In Judgment in predictive analytics (135–184). Springer.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Boosting wisdom of the crowd for medical image annotation using training performance and task features

Affiliations

Boosting wisdom of the crowd for medical image annotation using training performance and task features

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources