Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 20;9(1):31.
doi: 10.1186/s41235-024-00558-6.

Boosting wisdom of the crowd for medical image annotation using training performance and task features

Affiliations

Boosting wisdom of the crowd for medical image annotation using training performance and task features

Eeshan Hasan et al. Cogn Res Princ Implic. .

Abstract

A crucial bottleneck in medical artificial intelligence (AI) is high-quality labeled medical datasets. In this paper, we test a large variety of wisdom of the crowd algorithms to label medical images that were initially classified by individuals recruited through an app-based platform. Individuals classified skin lesions from the International Skin Lesion Challenge 2018 into 7 different categories. There was a large dispersion in the geographical location, experience, training, and performance of the recruited individuals. We tested several wisdom of the crowd algorithms of varying complexity from a simple unweighted average to more complex Bayesian models that account for individual patterns of errors. Using a switchboard analysis, we observe that the best-performing algorithms rely on selecting top performers, weighting decisions by training accuracy, and take into account the task environment. These algorithms far exceed expert performance. We conclude by discussing the implications of these approaches for the development of medical AI.

PubMed Disclaimer

Conflict of interest statement

Erik Duhaime is the CEO a stakeholder in Centaur Labs. Eeshan Hasan and Jennifer Trueblood do not hold any stakes in Centaur Labs and have no competing interests.

Figures

Fig. 1
Fig. 1
The panels on the left and the middle show the distribution of mean accuracy of different individuals for the test and train images, respectively, across all images. The chance accuracy is calculated as 1/7 since there were 7 different classes. The panel on the right shows the relationship between the accuracy of an individual and the number of responses provided by the individual
Fig. 2
Fig. 2
The top two panels show the confusion matrices when we pool the decisions from all individuals for the test and the train set, respectively. The bottom two panels show the confusion matrices for training for the individuals that provided the most and second most responses on the train set
Fig. 3
Fig. 3
The performance of simple voting on the different metrics based on the size of the crowd. The left panel shows the accuracy and balanced accuracy metrics and the right panel shows the mean ROC-AUC and malignant ROC-AUC. The 95% bootstrapped confidence intervals are depicted as transparent bands around the line
Fig. 4
Fig. 4
The results of the algorithms based on selecting the top individuals using their training performance. The 95% bootstrapped confidence intervals are depicted as transparent bands around the line
Fig. 5
Fig. 5
Inter-algorithm disagreement rate for accuracy weighting algorithms
Fig. 6
Fig. 6
Switchboard analysis of all the different algorithms

Similar articles

Cited by

  • Human-AI collectives most accurately diagnose clinical vignettes.
    Zöller N, Berger J, Lin I, Fu N, Komarneni J, Barabucci G, Laskowski K, Shia V, Harack B, Chu EA, Trianni V, Kurvers RHJM, Herzog SM. Zöller N, et al. Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2426153122. doi: 10.1073/pnas.2426153122. Epub 2025 Jun 13. Proc Natl Acad Sci U S A. 2025. PMID: 40512795 Free PMC article.

References

    1. Afflerbach P, van Dun C, Gimpel H, Parak D, Seyfried J. A simulation-based approach to understanding the wisdom of crowds phenomenon in aggregating expert judgment. Business & Information Systems Engineering. 2021;63:329–348. doi: 10.1007/s12599-020-00664-x. - DOI
    1. Alialy R, Tavakkol S, Tavakkol E, Ghorbani-Aghbologhi A, Ghaffarieh A, Kim S-H, Shahabi C. A review on the applications of crowdsourcing in human pathology. Journal of pathology informatics. 2018;9(1):2. doi: 10.4103/jpi.jpi_65_17. - DOI - PMC - PubMed
    1. Allen J, Arechar A-A, Pennycook G, Rand D-G. Scaling up fact-checking using the wisdom of crowds. Science Advances. 2021;7(36):eabf4393. doi: 10.1126/sciadv.abf4393. - DOI - PMC - PubMed
    1. Armstrong, J-S. (2001). Combining forecasts. Principles of forecasting: a handbook for researchers and practitioners, J. Scott Armstrong, ed., Norwell, MA: Kluwer Academic Publishers.
    1. Atanasov, P. & Himmelstein, M. (2023). Talent spotting in crowd prediction. In Judgment in predictive analytics (135–184). Springer.

LinkOut - more resources