Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 20;23(3):1219.
doi: 10.3390/s23031219.

Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis

Affiliations

Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis

Diogo Azevedo et al. Sensors (Basel). .

Abstract

The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.

Keywords: CompLearn; Kolmogorov complexity; Zgli; clustering by compression; clustering techniques; normalized compression distance.

PubMed Disclaimer

Conflict of interest statement

Not applicable.

Figures

Figure 1
Figure 1
All possible position of quartets of four nodes.
Figure 2
Figure 2
Looped rows example.
Figure 3
Figure 3
Binary tree generated using Zgli on the raw sizes of the Basketball data [29]. It can be seen that the shootings are the only type of data that is clustered together. Furthermore, the rest of the data is clustered linearly.
Figure 4
Figure 4
Binary tree generated using the looped data and bzlib without compression by column over the Basketball data set [29].
Figure 5
Figure 5
Binary tree generated using the looped data and bzlib with compression by column over the Basketball data set [29].

References

    1. Xu D., Tian Y. A Comprehensive Survey of Clustering Algorithms. Ann. Data Sci. 2015;2:165–193. doi: 10.1007/s40745-015-0040-1. - DOI
    1. Saxena A., Prasad M., Gupta A., Bharill N., Patel O.P., Tiwari A., Er M.J., Ding W., Lin C.T. A review of clustering techniques and developments. Neurocomputing. 2017;267:664–681. doi: 10.1016/j.neucom.2017.06.053. - DOI
    1. Henriques R., Madeira S.C. FleBiC: Learning classifiers from high-dimensional biomedical data using discriminative biclusters with non-constant patterns. Pattern Recognit. 2021;115:107900. doi: 10.1016/j.patcog.2021.107900. - DOI
    1. Soares D.F., Henriques R., Gromicho M., de Carvalho M., Madeira S.C. Learning prognostic models using a mixture of biclustering and triclustering: Predicting the need for non-Invasive ventilation in Amyotrophic Lateral Sclerosis. J. Biomed. Inform. 2022;134:104172. doi: 10.1016/j.jbi.2022.104172. - DOI - PubMed
    1. Hendricks R.M., Khasawneh M.T. A Systematic Review of Parkinson’s Disease Cluster Analysis Research. Aging Dis. 2021;12:1567–1586. doi: 10.14336/AD.2021.0519. - DOI - PMC - PubMed

LinkOut - more resources