Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 24;9(5):211189.
doi: 10.1098/rsos.211189. eCollection 2022 May.

Human-supervised clustering of multidimensional data using crowdsourcing

Affiliations

Human-supervised clustering of multidimensional data using crowdsourcing

Alexander Butyaev et al. R Soc Open Sci. .

Abstract

Clustering is a central task in many data analysis applications. However, there is no universally accepted metric to decide the occurrence of clusters. Ultimately, we have to resort to a consensus between experts. The problem is amplified with high-dimensional datasets where classical distances become uninformative and the ability of humans to fully apprehend the distribution of the data is challenged. In this paper, we design a mobile human-computing game as a tool to query human perception for the multidimensional data clustering problem. We propose two clustering algorithms that partially or entirely rely on aggregated human answers and report the results of two experiments conducted on synthetic and real-world datasets. We show that our methods perform on par or better than the most popular automated clustering algorithms. Our results suggest that hybrid systems leveraging annotations of partial datasets collected through crowdsourcing platforms can be an efficient strategy to capture the collective wisdom for solving abstract computational problems.

Keywords: crowdsourcing; data clustering; games; human-computing.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Illustration of multiple scenes of Colony B. (a) Clustering panel with stage/total scores and puzzle progress, (b) end-game screen showing the progress towards a new badge discovery, (c) list of available thematic badges, (d) educational information related to a badge, and (e) leaderboard with multiple leagues.
Figure 2.
Figure 2.
Game flow showing the first two stages and the process of puzzle solving. The player: (a) is presented with initial stage data; (b) makes a selection of the most representative group of dots on the mobile screen; (c) receives feedback from the game (score); and (d) is presented with the next stage data. It also shows in blue the points selected by the player on the previous stage. The player selects a group of dots for this stage and (e) receives feedback from the game.
Figure 3.
Figure 3.
Performance comparison of the human-based algorithms with automated clustering approaches applied to the synthetic dataset for (a) all clusters; (b) low-dimensional clusters; and (c) high-dimensional clusters. The colour scheme is used to separate groups of algorithms: hubCLIQUE (red), automated CLIQUE (orange), CloCworks (purple), algorithms that require a known number of clusters/components (green), and algorithms that do not require it (cyan). For GMM, the average value over 1000 runs is reported. For algorithms requiring a known number of clusters, we report their performance with the number of clusters estimated by the elbow heuristic (four).
Figure 4.
Figure 4.
Analysis of the performance of CloCworks applied to a series of Colony B-simulated networks using the stochastic block models (SBM) network generator. CloCworks was applied to every set of networks.
Figure 5.
Figure 5.
Performance comparison of the human-based algorithms with automated clustering approaches applied to the voice recognition dataset (VRD). The figure notations and chosen colour scheme are described in figure 3.

References

    1. Chazal F, Michel B. 2021. An introduction to topological data analysis: fundamental and practical aspects for data scientists. Front. Artif. Intell. 4, 667963. (10.3389/frai.2021.667963) - DOI - PMC - PubMed
    1. Usama M, Qadir J, Raza A, Arif H, Yau KA, Elkhatib Y, Hussain A, Al-Fuqaha A. 2019. Unsupervised machine learning for networking: techniques, applications and research challenges. IEEE Access 7, 65 579-65 615. (10.1109/ACCESS.2019.2916648) - DOI
    1. Cai Z, Wang J, He K. 2020. Adaptive density-based spatial clustering for massive data analysis. IEEE Access 8, 23 346-23 358. (10.1109/ACCESS.2020.2969440) - DOI
    1. Xu X, Li J, Zhou M, Xu J, Cao J. 2020. Accelerated two-stage particle swarm optimization for clustering not-well-separated data. IEEE Trans. Syst. Man Cybern.: Syst. 50, 4212-4223. (10.1109/TSMC.2018.2839618) - DOI
    1. Cossarizza A, et al. 2019. Guidelines for the use of flow cytometry and cell sorting in immunological studies (second edition). Eur. J. Immunol. 49, 1457-1973. (10.1002/eji.201970107) - DOI - PMC - PubMed

LinkOut - more resources