Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar;17(1):357-377.
doi: 10.1214/22-aoas1631. Epub 2023 Jan 24.

MODELING CELL POPULATIONS MEASURED BY FLOW CYTOMETRY WITH COVARIATES USING SPARSE MIXTURE OF REGRESSIONS

Affiliations

MODELING CELL POPULATIONS MEASURED BY FLOW CYTOMETRY WITH COVARIATES USING SPARSE MIXTURE OF REGRESSIONS

By Sangwon Hyun et al. Ann Appl Stat. 2023 Mar.

Abstract

The ocean is filled with microscopic microalgae, called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small- and large-scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the northeast Pacific in the spring of 2017.

Keywords: Mixture of regressions; alternating direction method of multipliers; clustering; expectation-maximization; flow cytometry; gating; microbiome; ocean; phytoplankton; sparse regression.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A schematic showing the data setup. (Top) This figure shows the trajectory of the Gradients 2 cruise which moves north and then south along a trajectory starting at Hawaii. (Middle) The individual three-dimensional particles are measured rapidly and continuously. From this we form T cytograms y(t),t=1,,T at an hourly time resolution. The three data dimensions have simplified labels Red, Orange and Diameter; the first two represent fluorescence emission, and the last measures cell diameter. (Bottom panel) At each time t=1,,T, environmental covariates X(t)Rp are also available through remote sensing and on-board measurements. Only a few of the 30+ normalized covariates are highlighted here. Our proposed model identifies subpopulations by modeling them as Gaussian clusters whose means and probabilities are driven by environmental covariates.
Fig. 2.
Fig. 2.
Our method produces estimates of cluster centers (shown as disks) and cluster probabilities (represented by the size of the disk) for every time point. The covariance of each mixture component (represented by an ellipse) is assumed to be constant over time. Blue and red show parameter estimates at two different time points. In the background, particles from only one time point are shown in (partially transparent) dark blue with the size of a point proportional to the particle’s biomass. The right figure takes a closer look at a subregion of the cytogram, shown in the lower left corner of the left figure, focusing on cluster E which is a Prochlorococcus population. The change in the probability of cluster E is well predicted by sea surface temperature and phosphate, and the horizontal and vertical movement of cluster E’s center are each predicted by time-lagged sunlight and nitrate. Note, we are showing only five of the 10 clusters used for estimation.
Fig. 3.
Fig. 3.
Original particles (left) and binned counts with D=40 (middle) and binned biomass (right). In the middle and right plots, the size of the points are proportional to the multiplicity. The left-hand-side original cytogram contain one hour’s worth of particles, for a total of nt=36,757 points, occupying a total of 0.86 Mb of memory. The binned cytogram in the middle occupies about 1/8th the memory. The right-hand side shows binned biomass data which has lesser imbalance in cluster distribution than the binned count data in the middle.
Fig. 4.
Fig. 4.
(Left) The thick black line shows the first covariate X1RT which is a smoothed and standardized version of the par (sunlight) covariate from Section 4.0.1. The three thin lines show the obscured sunlight variables for three different noise levels σadd. The next covariate is a changepoint variable X2RT, shown as a thick red line. The remaining eight spurious covariates Xii=310 are generated as T i.i.d. entries from 𝒩0,1+σadd2; these are not shown here. (Right) An example of a generated dataset, whose particles are shown as grey points in the background. The two true cluster means are plotted as colored lines whose thickness is proportional to the cluster probabilities. Particles for both clusters are generated as 𝒩(0,1) around the cluster means. Cluster 1 is only present in the second half and has one quarter of the number of particles in cluster 2 in those time points. A thin dashed line is shown in the first half where the cluster probability is zero.
Fig. 5.
Fig. 5.
(Left) Out-of-sample prediction performance using covariates obscured by Gaussian noise variance σadd2 for the simulation setup described in Section 3.1.1. (Right) The probability of the sunlight covariate (the only relevant covariate for cluster means) being estimated as nonzero is shown in black lines. The corresponding probabilities for the eight spurious covariates are shown in red lines (thin red lines are individual covariates, and the thick red line is the average). The solid and dashed lines show results from cluster 1 and cluster 2, respectively. In both clusters the sunlight variable is more likely to be selected than the spurious variables. This advantage is more pronounced for cluster 1 than for cluster 2 which is only has data in the second half of the time range.
Fig. 6.
Fig. 6.
Out-of-sample prediction performance for K-cluster models estimated from five-cluster pseudo-real datasets (which were each generated from a simplified version of a model estimated from real one-dimensional data, in Section 4.0.1). Models estimated with fewer than five clusters have sharply worse out-of-sample prediction performance. On the other hand, estimated models with 5 clusters or more have similar out-of-sample prediction performance, because the extra clusters are estimated to have zero probability, and play no role in the prediction.
Fig. 7.
Fig. 7.
(Top) The one-dimensional cell diameter biomass cytograms (log transformed) at an hourly time resolution is shown here. In the background, the one-dimensional biomass distribution of binned cell diameter data is shown in greyscale. (Bottom) The estimated five-cluster model is overlaid on the same plot; the five solid lines are the five estimated cluster means, whose thickness show the values of the K=5 cluster probabilities πktk=1K over time t=1,,296 (individual hours). The shaded region around the solid lines are the estimated ±2 standard deviation around the cluster means.
Fig. 8.
Fig. 8.
A one-dimensional slice of the estimated model of the full three-dimensional data, showing only the cell diameter axis. This figure is directly comparable to Figure 7 using only one-dimensional cell diameter data. The colored solid lines track the 10 estimated cluster means over time, and the line thickness shows the cluster probabilities over time. (The shaded 95% probability regions were omitted for clarity of presentation.) This model on three-dimensional data suggests finer movement of a larger number of cell populations that is not detectable using only the one-dimensional data. In particular, a clean separation of the heavily overlapping clusters 9 and 10 was not possible in the one-dimensional model but is clear in the three-dimensional model (also see Figure 9 that this separation is made apparent by using the additional red axis).
Fig. 9.
Fig. 9.
The estimated three-dimensional 10-cluster model, described in Section 4.0.2, at one time point. The size of the blue points represents the biomass in each of the 403 bins. The panels show various views of the cytograms—three 2D scatterplots and our estimated parameters (means, probabilities and covariances). The red dots mark the cluster centers at this time point, and the size (radius) of these red dots are proportional to the cluster probabilities. The red ellipses in dashed lines show the estimated 95% probability region of the data formed from the estimated Gaussian covariance of each cluster. The 10 estimated model clusters’ mean fluctuations and cluster probability dynamics over time can be seen in the full video in https://youtu.be/jSxgVvT2wr4—a single frame of this video is shown in Figure 9.
Fig. 10.
Fig. 10.
This figure shows the relative biomass of Prochlorococcus, measured in two ways—using traditional gating (black line) and using the estimated cluster probability of cluster 10 (purple) in the three-dimensional data in Section 4.0.2 and Figure 9. One noticeable discrepancy is on June 8th and 9th. The gating (black line) abruptly jumps from 0 to 0.5, due to flaws in automatic gating, while our model (purple) suggests a gradual increase on June 8th and onward. Visual inspection and expert annotation of this cluster in the cytogram suggests that our model cluster 10 is correctly tracking Prochlorococcus.

References

    1. Aghaeepour N, Finak G, Consortium F, Consortium DREAM, Hoos H, Mosmann TR, Brinkman R, Gottardo R and Scheuermann RH (2013). Critical assessment of automated flow cytometry data analysis techniques. Nat. Methods 10 228–238. - PMC - PubMed
    1. Ashkezari MD, Hagen NR, Denholtz M, Neang A, Burns TC, Morales RL, Lee CP, Hill CN and Armbrust EV (2021). Simons collaborative marine atlas project (Simons CMAP): An open-source portal to share, visualize and analyze ocean data. BioRxiv
    1. Berube PM, Biller SJ, Kent AG, Berta-Thompson JW, Roggensack SE, Roache-Johnson KH, Ackerman M, Moore LR, Meisel JD et al. (2015). Physiology and evolution of nitrate acquisition in prochlorococcus. ISME J 9 1195–1207. - PMC - PubMed
    1. Boyd S, Parikh N, Chu E, Peleato B and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn 3 1–122.
    1. Boyer TP, Antonov JI, Baranova OK, Garcia HE, Johnson DR, Mishonov AV, O’Brien TD, Seidov D, Smolyar II et al. (2013). World ocean database 2013