. 2021 Nov 12;16(11):e0259266.

doi: 10.1371/journal.pone.0259266. eCollection 2021.

Adaptive kernel fuzzy clustering for missing data

Anny K G Rodrigues¹, Raydonal Ospina¹, Marcelo R P Ferreira²

Affiliations

¹ Departamento de Estatística, CASTLab, CCEN, Universidade Federal de Pernambuco, Cidade Universitária, Recife, PE, Brazil.
² Departamento de Estatística, DataLab, Centro de Ciências Exatas e da Natureza, Universidade Federal da Paraíba, João Pessoa, PB, Brazil.

PMID: 34767560
PMCID: PMC8589222
DOI: 10.1371/journal.pone.0259266

Adaptive kernel fuzzy clustering for missing data

Anny K G Rodrigues et al. PLoS One. 2021.

. 2021 Nov 12;16(11):e0259266.

doi: 10.1371/journal.pone.0259266. eCollection 2021.

Authors

Anny K G Rodrigues¹, Raydonal Ospina¹, Marcelo R P Ferreira²

Affiliations

¹ Departamento de Estatística, CASTLab, CCEN, Universidade Federal de Pernambuco, Cidade Universitária, Recife, PE, Brazil.
² Departamento de Estatística, DataLab, Centro de Ciências Exatas e da Natureza, Universidade Federal da Paraíba, João Pessoa, PB, Brazil.

PMID: 34767560
PMCID: PMC8589222
DOI: 10.1371/journal.pone.0259266

Abstract

Many machine learning procedures, including clustering analysis are often affected by missing values. This work aims to propose and evaluate a Kernel Fuzzy C-means clustering algorithm considering the kernelization of the metric with local adaptive distances (VKFCM-K-LP) under three types of strategies to deal with missing data. The first strategy, called Whole Data Strategy (WDS), performs clustering only on the complete part of the dataset, i.e. it discards all instances with missing data. The second approach uses the Partial Distance Strategy (PDS), in which partial distances are computed among all available resources and then re-scaled by the reciprocal of the proportion of observed values. The third technique, called Optimal Completion Strategy (OCS), computes missing values iteratively as auxiliary variables in the optimization of a suitable objective function. The clustering results were evaluated according to different metrics. The best performance of the clustering algorithm was achieved under the PDS and OCS strategies. Under the OCS approach, new datasets were derive and the missing values were estimated dynamically in the optimization process. The results of clustering under the OCS strategy also presented a superior performance when compared to the resulting clusters obtained by applying the VKFCM-K-LP algorithm on a version where missing values are previously imputed by the mean or the median of the observed values.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Types of missing data patterns.**
(a) *Multivariate*. (b) *Monotone*. (C) *General*. (d) *File-matching*.

**Fig 2. Scatter plots and boxplots for the *Iris Plant* dataset.**
(a) Length. (b) Width.

**Fig 3. Visualizations of the patterns and frequencies of the missing values by variable for the *Iris Plant* dataset.**
(a) 5% *missing*. (b) 10% *missing*. (c) 15% *missing*. (d) 20% *missing*.

**Fig 4. Average error rates after 100 repetitions for the *Iris Plant* dataset.**

**Fig 5. Scatter plots and boxplots for the *Thyroid Gland* dataset.**
(a) TST. (b) TTS.

**Fig 6. Graphs of missing value patterns and frequencies per variable for the *Thyroid Gland* dataset.**
(a) 5% *missing*. (b) 10% *missing*. (c) 15% *missing*. (d) 20% *missing*.

**Fig 7. Average results of 100 repetitions for the error rate with *Thyroid Gland* dataset.**

**Fig 8. Performance graphs of the methods for different percentages of missing values.**
(a) *Iris Plant*. (b) *Thyroid Gland*.

**Fig 9. Principal component analysis applied to both datasets.**
(a) *Iris Plant*. (b) *Thyroid Gland*.

**Fig 10. Scatter plots and boxplots for the *Thyroid Gland* dataset considering the different imputation methods.**
(a) Imputation via OCS with 5% of missing values. (b) Mean imputation with 5% of missing values. (c) Median imputation with 5% of missing values. (d) Imputation via OCS with 15% of missing values. (e) Mean imputation with 15% of missing values. (f) Median imputation with 5% of missing values.

See this image and copyright information in PMC

References

1. Estivill-Castro V. Why so many clustering algorithms: a position paper. SIGKDD explorations. 2002;4(1):65–75. doi: 10.1145/568574.568575 - DOI
1. Shen H, Yang J, Wang S, Liu X. Attribute weighted mercer kernel based fuzzy clustering algorithm for general non-spherical datasets. Soft Computing. 2006;10(11):1061–1073. doi: 10.1007/s00500-005-0043-5 - DOI
1. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM computing surveys (CSUR). 1999;31(3):264–323. doi: 10.1145/331499.331504 - DOI
1. Xu R, Donald Wunsch I. Survey of Clustering Algorithms. IEEE TRANSACTIONS ON NEURAL NETWORKS. 2005;16(3):645. doi: 10.1109/TNN.2005.845141 - DOI - PubMed
1. Filippone M, Camastra F, Masulli F, Rovetta S. A survey of kernel and spectral methods for clustering. Pattern recognition. 2008;41(1):176–190. doi: 10.1016/j.patcog.2007.05.018 - DOI

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Adaptive kernel fuzzy clustering for missing data

Affiliations

Adaptive kernel fuzzy clustering for missing data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous