. 2020 Nov 18;9(11):giaa127.

doi: 10.1093/gigascience/giaa127.

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Miroslav Kratochvíl^{1

2}, Oliver Hunewald³, Laurent Heirendt⁴, Vasco Verissimo⁴, Jiří Vondrášek¹, Venkata P Satagopam^{4

5}, Reinhard Schneider^{4

5}, Christophe Trefois^{4

5}, Markus Ollert^{3

6}

Affiliations

¹ Institute of Organic Chemistry and Biochemistry, Flemingovo náměstí 542/2, 160 00 Prague, Czech Republic.
² Charles University, Department of Software Engineering, Malostranské náměstí 25, 118 00 Prague, Czech Republic.
³ Luxembourg Institute of Health, Department of Infection and Immunity, 29 rue Henri Koch, L-4354 Esch-sur-Alzette, Luxembourg.
⁴ University of Luxembourg, Luxembourg Centre for Systems Biomedicine, 6 avenue du Swing, Campus Belval, L-4367 Belvaux, Luxembourg.
⁵ ELIXIR Luxembourg, University of Luxembourg, 6, avenue du Swing, Campus Belval, L-4367 Belvaux, Luxembourg.
⁶ Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, OdenseUniversity Hospital, University of Southern Denmark, Kløvervænget 15, DK-5000 Odense C, Denmark.

PMID: 33205814
PMCID: PMC7672468
DOI: 10.1093/gigascience/giaa127

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Miroslav Kratochvíl et al. Gigascience. 2020.

. 2020 Nov 18;9(11):giaa127.

doi: 10.1093/gigascience/giaa127.

Authors

Affiliations

¹ Institute of Organic Chemistry and Biochemistry, Flemingovo náměstí 542/2, 160 00 Prague, Czech Republic.
² Charles University, Department of Software Engineering, Malostranské náměstí 25, 118 00 Prague, Czech Republic.
³ Luxembourg Institute of Health, Department of Infection and Immunity, 29 rue Henri Koch, L-4354 Esch-sur-Alzette, Luxembourg.
⁴ University of Luxembourg, Luxembourg Centre for Systems Biomedicine, 6 avenue du Swing, Campus Belval, L-4367 Belvaux, Luxembourg.
⁵ ELIXIR Luxembourg, University of Luxembourg, 6, avenue du Swing, Campus Belval, L-4367 Belvaux, Luxembourg.
⁶ Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, OdenseUniversity Hospital, University of Southern Denmark, Kløvervænget 15, DK-5000 Odense C, Denmark.

PMID: 33205814
PMCID: PMC7672468
DOI: 10.1093/gigascience/giaa127

Abstract

Background: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena.

Results: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study.

Conclusions: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.

Keywords: Julia; clustering; dimensionality reduction; high-performance computing; self-organizing maps; single-cell cytometry.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1**
Architecture of GigaSOM.jl. *Top:* Data distribution process divides the available FCS files into balanced slices; individual workers retrieve their respective slice data using a shared storage. *Bottom:*The SOM learning and visualization processes require only a minimal amount of data to be transferred between the master and worker nodes, consisting of the relatively small codebook in the case of SOM learning (blue arrows) and pre-rasterized graphics in the case of visualization (green arrows).

**Figure 2**
Comparison of GigaSOM.jl results with manual gating of the Levine32 dataset. The confusion matrix is normalized in rows, showing the ratio of cells in each aggregate of GigaSOM-originating clusters that matches the cell types from manual analysis. Darker color represents better match. The mean F1 score is comparable to FlowSOM. A more comprehensive comparison is available in Supplementary Fig. S1.

**Figure 3**
Performance dependency of distributed algorithms in GigaSOM on data dimensionality, SOM size, and number of available workers. Data processing performance is displayed as normalized to median speed in cells per second (c/s).

**Figure 4**
Effect of data-indexing structures on GigaSOM performance. The plotted points show relative speedup of the algorithms utilizing kd-trees (horizontal axis) and ball-trees (vertical axis) compared with brute-force neighbor search. Baseline (1× speedup) is highlighted by thick grid lines—a point plotted in the upper right quadrant represents a benchmark measurement that showed speedup for both kd-trees and ball-trees, upper left quadrant contains benchmark results where ball-trees provided speedup and kd-trees slowed the computation down, etc.

**Figure 5**
Raw IMPC Spleen T-cell dataset, processed by GigaSOM.jl and embedded by the Julia implementation of EmbedSOM. The figure shows an aggregate of 1,167,129,317 individual cells. Expression of 3 main markers is displayed in combination as mixed colors: CD8 in red, CD4 in green, and CD161 in blue. A more detailed, annotated version of the visualization is available in Supplementary Fig. S4.

See this image and copyright information in PMC

References

1. Bandura DR, Baranov VI, Ornatsky OI, et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal Chem. 2009;81(16):6813–22. - PubMed
1. Jaitin DA, Kenigsberg E, Keren-Shaul H, et al. Massively parallel single-cell RNA-Seq for marker-free decomposition of tissues into cell types. Science. 2014;343(6172):776–79. - PMC - PubMed
1. Schmutz S, Valente M, Cumano A, et al. Spectral cytometry has unique properties allowing multicolor analysis of cell suspensions isolated from solid tissues. PLoS One. 2016;11(8):e0159961. - PMC - PubMed
1. Mair F, Hartmann FJ, Mrdjen D, et al. The end of gating? An introduction to automated analysis of high dimensional cytometry data. Eur J Immunol. 2016;46(1):34–43. - PubMed
1. Arvaniti E, Claassen M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat Commun. 2017;8(1):1–10. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Affiliations

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases