Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Aug;97(8):782-799.
doi: 10.1002/cyto.a.24158. Epub 2020 Jun 30.

A Cancer Biologist's Primer on Machine Learning Applications in High-Dimensional Cytometry

Affiliations
Review

A Cancer Biologist's Primer on Machine Learning Applications in High-Dimensional Cytometry

Timothy J Keyes et al. Cytometry A. 2020 Aug.

Abstract

The application of machine learning and artificial intelligence to high-dimensional cytometry data sets has increasingly become a staple of bioinformatic data analysis over the past decade. This is especially true in the field of cancer biology, where protocols for collecting multiparameter single-cell data in a high-throughput fashion are rapidly developed. As the use of machine learning methodology in cytometry becomes increasingly common, there is a need for cancer biologists to understand the basic theory and applications of a variety of algorithmic tools for analyzing and interpreting cytometry data. We introduce the reader to several keystone machine learning-based analytic approaches with an emphasis on defining key terms and introducing a conceptual framework for making translational or clinically relevant discoveries. The target audience consists of cancer cell biologists and physician-scientists interested in applying these tools to their own data, but who may have limited training in bioinformatics. © 2020 International Society for Advancement of Cytometry.

Keywords: cancer; computational cytometry; data science; machine learning; mass cytometry.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare.

Figures

Figure 1 –
Figure 1 –. An increasing number of studies are using machine learning to analyze biomedical data.
Bar graphs indicating the number of PubMed Central search results for (A) the query “machine learning” and “medicine” since 1997 and (B) the query “machine learning” and “cytometry” since 2000.
Figure 2 –
Figure 2 –. Schematic diagram representing the analysis strategy described in this review.
We encourage the reader to begin their analyses using exploratory approaches such as dimensionality reduction and basic visualization, later progressing to unsupervised clustering and predictive modeling/correlative biology. Bidirectional arrows are included in the diagram to emphasize that each stage of an analysis will influence multiple other stages, with results from earlier stages often informing the analytic approach in the subsequent stage (and vice versa). Throughout the figure, exploratory analyses are coded as blue, whereas more targeted analyses are coded as red.
Figure 3 –
Figure 3 –. Dimensionality reduction using 3 commonly-used approaches: PCA, t-SNE, and UMAP.
10,000 cells were subsampled from 3 B-cell Progenitor Acute Lymphoblastic Leukemia (BCP-ALL) patient samples analyzed using mass cytometry. Data were obtained from the GitHub repository from Good et al. (2018). (A) Two-dimensional plot of 3 patient samples along their first (PC1) and second (PC2) principal component axes. Note that PCA does not require the user to set any hyperparameters and will return the same result each time it is used. (B) Two-dimensional plot after performing t-SNE on the same cells as in A across several t-SNE hyperparameter values. Note that samples fail to separate from one another when the number of iterations is too low and that neither inter-sample distances or dispersion are conserved across perplexity settings. (C) Two-dimensional plot computed using the same cells as in A-B across varying levels of min-dist and n. Slightly different embeddings result from different hyperparameter settings, although global relationships are more robust to these changes than those observed in t-SNE embeddings.
Figure 4 –
Figure 4 –. Comparison of clustering results using SPADE, K-means clustering, PhenoGraph, and FlowSOM.
10,000 cells were subsampled from each of 3 B-cell Progenitor Acute Lymphoblastic Leukemia (BCP-ALL) patient samples analyzed using mass cytometry. Data were obtained from the GitHub repository from Good et al. (2018) and analyzed using R implementations of SPADE, k-means clustering, PhenoGraph, and FlowSOM. PhenoGraph automatically detected the presence of 4 clusters, so this number of clusters was specified for the 3 remaining algorithms in order to compare results; otherwise, default parameters were used. Contour plots were embedded within UMAP axes computed using all 30,000 subsampled cells, with distinct clusters identified by each algorithm represented with a unique color in each panel. Across all clustering methods, markers used for clustering were the following: CD19, CD20, CD24, CD34, CD38, CD127, CD179a, CD179b, IgM (intracellular and extracellular), and terminal deoxynucleotidyl transferase. Notably, different clustering approaches identify subtly different cellular subsets even within this relatively simple dataset. Often, iteratively testing different clustering approaches, visualizing the results, and adjusting hyperparameters can help to determine which method fits best for one’s particular dataset.
Figure 5 –
Figure 5 –. Graph architectures can be used to represent cytometry data.
(A) Schematization of a “graph,” a data structure that expresses observations as nodes and the relationships between observations as edges. The red arrow points to a node; the blue arrow points to an edge. (B) Example graphs constructed from cytometry data collected via CyTOF (data taken from Good et al., 2018). This is an example of a graph representing single-cell cytometry data: in it, the nodes represent clusters of single-cell observations and the edges represent relationships between those nodes. In this case, a k-nearest-neighbor graph was built, meaning that each cluster is connected to the k clusters to which it is most similar (using Euclidean distance and k = 3). (B) Clustering was performed by applying PhenoGraph in healthy and leukemic samples. Each cluster’s expression level of phosphylated Syk protein (pSyk), a relapse-predictive feature in pediatric BCPALL, is indicated colorimetrically for each node. This example graph illustrates how biological parameters can be depicted by using a graph-based representation.
Figure 6 –
Figure 6 –. Schematization of Good et al.’s single-cell developmental classifier.
Using this approach, cancer cells are classified into their most analogous healthy cell type in normal lineage development in a series of 2 steps. First, healthy populations across lineage development are manually gated, and a single-cell “barcode” of marker expression values is computed for each manually-gated subpopulation. Second, cancer cells are aligned with their most similar healthy subpopulation based on the Mahalanobis distance between their marker expression profile (or “barcode”) and that of each manually-gated population. Using this method, cancer cells can be classified into readily interpretable, “healthy-like” cell subtypes that each have unique properties.

References

    1. Herzenberg LA, Parks D, Sahaf B, Perez O, Roederer M, & Herzenberg LA (2002). The History and Future of the Fluorescence Activated Cell Sorter and Flow Cytometry: A View from Stanford. Clinical Chemistry, 48:10, 1819–1827. - PubMed
    1. Pyne S, Hu X, Wang K, Rossin E, Lin T, Maier LM, … Mesirov JP (2009). Automated high-dimensional flow cytometric data analysis. PNAS, 106(21), 8519–8524. - PMC - PubMed
    1. Bendall SC, Simonds EF, Qiu P, Amir ED, Krutzik PO, … Nolan GP (2011). Single-Cell Mass Cytometry of Differential a Human Hematopoietic Continuum. Science, 687(May), 687–697. 10.1126/science.1198704 - DOI - PMC - PubMed
    1. Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, … Smibert P (2017). Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9), 865–868. 10.1038/nmeth.4380 - DOI - PMC - PubMed
    1. Behbehani GK (2017). Applications of Mass Cytometry in Clinical Medicine: The Promise and Perils of Clinical CyTOF. Clinics in Laboratory Medicine, 37(4), 945–964. 10.1016/j.cll.2017.07.010 - DOI - PubMed

Publication types

LinkOut - more resources