Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Aug 31:6:188.
doi: 10.3389/fonc.2016.00188. eCollection 2016.

Immunophenotype Discovery, Hierarchical Organization, and Template-Based Classification of Flow Cytometry Samples

Affiliations

Immunophenotype Discovery, Hierarchical Organization, and Template-Based Classification of Flow Cytometry Samples

Ariful Azad et al. Front Oncol. .

Abstract

We describe algorithms for discovering immunophenotypes from large collections of flow cytometry samples and using them to organize the samples into a hierarchy based on phenotypic similarity. The hierarchical organization is helpful for effective and robust cytometry data mining, including the creation of collections of cell populations' characteristic of different classes of samples, robust classification, and anomaly detection. We summarize a set of samples belonging to a biological class or category with a statistically derived template for the class. Whereas individual samples are represented in terms of their cell populations (clusters), a template consists of generic meta-populations (a group of homogeneous cell populations obtained from the samples in a class) that describe key phenotypes shared among all those samples. We organize an FC data collection in a hierarchical data structure that supports the identification of immunophenotypes relevant to clinical diagnosis. A robust template-based classification scheme is also developed, but our primary focus is in the discovery of phenotypic signatures and inter-sample relationships in an FC data collection. This collective analysis approach is more efficient and robust since templates describe phenotypic signatures common to cell populations in several samples while ignoring noise and small sample-specific variations. We have applied the template-based scheme to analyze several datasets, including one representing a healthy immune system and one of acute myeloid leukemia (AML) samples. The last task is challenging due to the phenotypic heterogeneity of the several subtypes of AML. However, we identified thirteen immunophenotypes corresponding to subtypes of AML and were able to distinguish acute promyelocytic leukemia (APL) samples with the markers provided. Clinically, this is helpful since APL has a different treatment regimen from other subtypes of AML. Core algorithms used in our data analysis are available in the flowMatch package at www.bioconductor.org. It has been downloaded nearly 6,000 times since 2014.

Keywords: classification; clusters; flow cytometry; matching; meta-clusters; template.

PubMed Disclaimer

Figures

Figure 1
Figure 1
In our view, six major steps are involved in the FC data analysis. An FC sample is represented by an n × p matrix, where n is the number of cells and p is the number of features measured in each cell. (1) The overlap of two spectra (green and yellow) emitted by two fluorochromes, which must be unmixed to correctly reconstruct the signals. (2) The density plots of a marker from several samples of a dataset after transforming data to stabilize the variance. (3) Four cell populations (marked with different colors) identified by a clustering algorithm. (4) Matching population to register corresponding cell clusters across a pair of samples. (5) The hierarchical construction of a template from six samples belonging to the same class. (6) Classifying a sample based on its similarity with two templates.
Figure 2
Figure 2
Removing unintended events from an FC sample. (A) Single intact cells (inside the red polygon gate) are separated from the doublets (outside of the red polygon gate). (B) A viability marker (ViViD) is used to remove dead cells (outside of the red polygon gate). (C) Cells emitting very low or very high fluorescence signals (outside of the red vertical lines) are removed as potential outlying events.
Figure 3
Figure 3
Selecting the optimum number of cell populations in a sample from the HD dataset by the flowMeans package (18). The maximum number of clusters is set to: (A) 5 clusters (automatically selected by algorithm), (B) 10 clusters, and (C) 20 clusters. The optimum number of clusters is selected by detecting change point in the segmented regression lines and is shown with a red filled circle in each subfigure.
Figure 4
Figure 4
(A) Some of the terminologies used in this paper. A cell cluster or cell population is a group of cells expressing similar features, and an FC sample is a collection of cell clusters. A meta-cluster is a set of similar cell clusters from different samples, and a template is a collection of meta-clusters. Cells are denoted by dots, clusters by solid ellipses, and meta-clusters by dashed ellipses. (B) An example of a hierarchical template tree created from four hypothetical samples S1, S2, S3, and S4. A leaf node of the template tree represents a sample and an internal (non-leaf) node represents a template created from its children in the tree. The children could be templates if they are interior nodes or samples if they are leaves. (C) One step of the HM&M algorithm creating a template T(S3, S4) from a pair of samples. The algorithm first matches clusters (or meta-clusters) across samples (or templates) by the MEC algorithm and then merges the matched clusters to construct new meta-clusters.
Figure 5
Figure 5
(A) A predefined rectangular gate (red rectangle on the lower left corner) denotes an approximate boundary for the lymphocytes. (B) Inside the rectangular gate, lymphocytes are identified as a dense and normally distributed region (red ellipse). (C) Outlying cells fall outside of a pair of predefined thresholds shown with the red vertical lines and are removed. (D) Correlated CD4 and CD8 expressions due to the spectral overlap between PE and ECD fluorochromes. (E) CD8 vs. CD4 expressions after spectral unmixing. The inverse hyperbolic sine (asinh) transformation is used in (D,E) for visualization.
Figure 6
Figure 6
Stabilizing the within-cluster variance for each channel of the HD dataset. (A) Variances of the clusters increase monotonously with their means before the variances are stabilized. Clusters in each marker are shown with the same symbol and color. (B) Variances are approximately stabilized for each marker/channel after the data are transformed by the asinh function with the optimum cofactor. (C) Density of the variance-stabilized fluorescence channels are plotted where different subjects are denoted with different colors.
Figure 7
Figure 7
(A) Simultaneous optimization of five cluster validation criteria suggests that four cell populations are present in this sample. Here, three of the indices are maximized and two are minimized. (B) Bivariate projections of cell populations display four subsets of lymphocytes: red (natural killer cells), blue (B cells), black (helper T cells), and green (cytotoxic T cells). Each cell cluster is CD45+ since we pre-selected lymphocytes on the forward and side scatter channels.
Figure 8
Figure 8
(A) The template tree created by HM&M algorithm from all samples of the HD dataset. Leaves of the dendrogram denote samples from five healthy individuals. An internal node represents a template, and the height of an internal node measures the dissimilarity between its left and right children. The sample-specific subtrees are drawn in different colors. (B) Bivariate projections of the combined template [the root of the tree in (A)] are drawn in terms of its meta-clusters. Here, each meta-cluster is represented by a homogeneous collection of cell clusters that are drawn with the 95th quantile contour lines. Clusters participating in a meta-cluster are drawn in same color.
Figure 9
Figure 9
Hierarchical organization of HD samples using the HM&M algorithm. The Euclidean distance between cluster/meta-cluster centers is used when computing the mixed edge cover.
Figure 10
Figure 10
Hierarchical organization of HD samples using the UPGMA algorithm using (A) the mixed edge cover (with Mahalanobis distance as the distance between clusters) and (B) the earth mover’s distance as the dissimilarity between every pair of samples.
Figure 11
Figure 11
Cell types identified on the side scatter (SS) and CD45 channels for a healthy and an AML-positive sample. Cell populations are discovered in the seven-dimensional samples with the clustering algorithm and then projected on these channels for visualization. A pair of clusters denoting the same cell type is marked with the same color. The proportion of myeloid blast cells (shown in red) increases significantly in the AML sample.
Figure 12
Figure 12
The healthy and AML templates created from Tube 6. (A) The template tree created from 156 healthy samples in the training set. (B) The template tree created from 23 AML samples in the training set. Samples in the red subtree exhibit the characteristics of acute promyelocytic leukemia (APL) as shown in (F). (C) Fraction of 156 healthy samples present in each of the 22 meta-clusters in the healthy template. Nine meta-clusters, each of them shared by at least 60% of the healthy samples, form the core of the healthy template. (D) Fraction of 23 AML samples present in each of the 40 meta-clusters in the AML template. The AML samples, unlike the healthy ones, are heterogeneously distributed over the meta-clusters. (E) The expression levels of markers in the meta-cluster shown with blue bar in (D). [Each horizontal bar in (E,F) represents the average expression of a marker and the error bar shows its SD]. This meta-cluster represents lymphocytes denoted by medium SS and high CD45 expression and therefore does not express the AML-related markers measured in Tube 6. (F) Expression of markers in a meta-cluster shown with red bar in (D). This meta-cluster denotes myeloblast cells as defined by the SS and CD45 levels. This meta-cluster expresses HLA-DRCD117+CD34CD38+, a characteristic immunophenotype of APL. Five AML samples sharing this meta-cluster are similar to each other as shown in the red subtree in (B).
Figure 13
Figure 13
Bivariate contour plots (side scatter vs. individual marker) for two meta-clusters (one in each row) indicative of AML. The ellipses in a subplot denote the 95th quantile contour lines of cell populations included in the corresponding meta-cluster. Myeloblast cells have medium side scatter (SS) and CD45 expressions. The red lines indicate approximate myeloblast boundaries (located on the left-most subfigures in each row and extended horizontally to the subfigures on the right) and confirm that these meta-clusters represent immunophenotypes of myeloblast cells. Blue vertical lines denote the ± boundaries of a marker. Gray subplots show contour plots of dominant markers defining the meta-cluster in a row. (A) CD13+CD56 meta-cluster shared by 17 AML samples in Tube 4. (B) CD4CD11c+CD64CD33+ meta-cluster shared by 18 AML samples in Tube 5. (C) HLA-DR+CD117+CD34+CD38+ meta-cluster shared by 11 AML samples in Tube 6. (D) HLA-DRCD117±CD34CD38+ meta-cluster shared by 5 AML samples in Tube 6. The last meta-cluster is indicative of acute promyelocytic leukemia (APL).
Figure 14
Figure 14
Average classification score from Tubes 4 to 6 for each sample in the (A) training set and (B) test set. Samples with scores above the horizontal line are classified as AML and as healthy otherwise. The actual class of each sample is also shown. An AML sample (subject id 116) is always misclassified in the training set, and this is discussed in the text.
Figure 15
Figure 15
Cell populations in a samples from subject 116. This sample contains only 4.4% myeloid blast cells (shown in red).

References

    1. Shapiro HM. Practical Flow Cytometry. Hoboken, NJ: Wiley-Liss; (2005).
    1. Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, et al. Critical assessment of automated flow cytometry data analysis techniques. Nat Methods (2013) 10:228–38.10.1038/nmeth.2365 - DOI - PMC - PubMed
    1. Pyne S, Hu X, Wang K, Rossin E, Fin T, Maier F, et al. Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci U S A (2009) 106:8519–24.10.1073/pnas.0903028106 - DOI - PMC - PubMed
    1. Spidlen J, Barsky A, Breuer K, Carr P, Nazaire MD, Hill BA, et al. GenePattern flow cytometry suite. Source Code Biol Med (2013) 8:14.10.1186/1751-0473-8-14 - DOI - PMC - PubMed
    1. Kotecha N, Krutzik PO, Irish IM. Web-based analysis and publication of flow cytometry experiments. Curr Protoc Cytom. Wiley Online Library; (2010) 10:10–17.10.1002/0471142956.cy1017s53 - DOI - PMC - PubMed

LinkOut - more resources