. 2022 Oct 3;18(10):e1010349.

doi: 10.1371/journal.pcbi.1010349. eCollection 2022 Oct.

HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

Affiliations

¹ Immunodynamics section, Laboratory of Integrative Cancer Immunology, National Cancer Institute, Bethesda, Maryland, United States of America.
² Department of Physics, Boston University, Boston, Massachussets, United States of America.
³ Systemic Autoimmunity Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, Maryland, United States of America.

PMID: 36191000
PMCID: PMC9560626
DOI: 10.1371/journal.pcbi.1010349

HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

James Anibal et al. PLoS Comput Biol. 2022.

. 2022 Oct 3;18(10):e1010349.

doi: 10.1371/journal.pcbi.1010349. eCollection 2022 Oct.

Authors

Affiliations

¹ Immunodynamics section, Laboratory of Integrative Cancer Immunology, National Cancer Institute, Bethesda, Maryland, United States of America.
² Department of Physics, Boston University, Boston, Massachussets, United States of America.
³ Systemic Autoimmunity Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, Maryland, United States of America.

PMID: 36191000
PMCID: PMC9560626
DOI: 10.1371/journal.pcbi.1010349

Abstract

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Overclustering (n_Cluster = 4) a) versus underclustering (n_Cluster = 2) b).**
The ground truth has 3 distinct clusters. The black lines represent the decision boundaries obtained via a random forest classifier trained on the clustering labels. a) Using over clustered labels for fitting a classifier leads to overfitted decision boundaries that separate regions of high density (red ang green region separation). b) For underclustered labels, the classifier groups together two clusters separated by a region of low density.

**Fig 2. Out-of-sample accuracy vs. weighted F-score (comparison to ground-truth) for various clustering algorithms.**
We used out-of-the-box clustering algorithms *DBSCAN*, *spectral clustering*, K-means and *Meanshift* with various hyper-parameters and clustered 6 different benchmark two-dimensional datasets: Noisy Circles, Gaussian mixtures with different covariance structures, non-convex clusters (Noisy Moons) and a random noise map (No Structure). These datasets were taken from scikit-learn clustering methods as benchmark datasets (see S1 Methods for more details). The out-of-sample accuracy is computed by training a random forest classifier with 100 estimators on the clustering labels provided by the clustering algorithms. We use a 80/20 train/test split to train the classifiers and evaluate the out-of-sample error. Notice that the F-score and out-of-sample accuracy as measured are monotonically related.

**Fig 3. Overview of the hierarchical agglomerative learning approach.**
(a) For high-dimensional inputs, a dimensionality (c.g. UMAP [31], t-SNE, etc.) reduction step is required in order to obtain reliable density estimates. (b) In low-dimensional spaces, density maps can be easily computed. Initial clusters are selected to be the neighborhood of the density modes (see [17]). (c) A k nearest-neighbor graph is constructed by measuring similarity via the *out-of-sample accuracy* by training classifiers in the original high-dimensional feature space: each node represents an individual cluster and each edge has an associated weight given by the accuracy of the classifier. (d) Nodes are successively merged by pairs following the procedure until a desired out-of-sample accuracy is reached. The end result consists of an interpretable hierarchical classifier and robust clustering assignments. The classifier can be used to predict the labels of new data and potentially identify outliers.

**Fig 4. Clustering based on out-of-sample error.**
(a) Clusters generated with a cv-score of 0.97. (b) Graph illustrating out-of-sample error for each pair of clusters. There are no pairs of clusters with an out-of-sample error that is greater than 0.03 (accuracy value of 0.97). (c) Clusters generated with a cv-score of 0.98. Clusters 1 and 4 have been merged because the out-of-sample error was greater than 0.02. (d) Graph illustrating accuracy for each pair of clusters. There are no pairs of clusters with an out-of-sample error that is greater than 0.02 (accuracy value of 0.98).

Fig 5. Comparison of run times for training and label prediction on synthetic datasets of increasing sizes (comparing HAL-X with other clustering algorithms routinely used with cytometry data -Parc, FlowSOM, HDBScan & Phenograph).
The graybox indicates that the memory of laptop was exceeded for large datasets when using other algorithms than HAL-X.

**Fig 6. Analysis of Lupus patients’ blood profile using single-cell mass cytometry and HAL-x.**
(A) The pipeline for profiling immune cells from Lupus patients and healthy donors, beginning with blood collection, continuing with mass cytometry, and culminating with HAL-x processing. (B) t-SNE visualizations of clusterings generated by HAL-x which correspond to various levels of population specificity (cv-score). (C) Heatmap of the average expression levels for the 19 leukocyte clusters defined by HAL-x for a cv-score of 0.7. The Cell type annotations were made hand. (D) A graph of AUC (*i.e*. Area under the Receiver Operating Curve (ROC)) for different cv-scores shows that the most important clusters with discriminatory power are not observed for high CV-scores. The best cluster identified by HAL-x (cluster #42 for cv-score of 0.7), separates individual neutrophils from Lupus and Healthy patients with an AUC of 0.82. (E) Markers on neutrophils (cluster #42) that can be used to distinguish blood samples from Lupus patients and from healthy donors. Note how subtle the difference is at the level of individual markers.

See this image and copyright information in PMC

References

1. Bellman R. Dynamic Programming. Courier Corporation; 2013.
1. Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis. 2014;71:52–78. doi: 10.1016/j.csda.2012.12.008 - DOI
1. Aggarwal CC, Reddy CK, editors. Data Clustering: Algorithms and Applications. 1st ed. Boca Raton: Chapman and Hall/CRC; 2013.
1. Bendall SC, Simonds EF, Qiu P, Amir EaD, Krutzik PO, Finck R, et al. Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum. Science (New York, Ny). 2011;332(6030):687–696. doi: 10.1126/science.1198704 - DOI - PMC - PubMed
1. Levine JH, Simonds EF, Bendall SC, Davis KL, Amir EaD, Tadmor MD, et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell. 2015;162(1):184–197. doi: 10.1016/j.cell.2015.05.047 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

R35 GM119461/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

Affiliations

HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources