This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Dec 17:arXiv:2407.16727v2.

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Ari Blau¹, Evan S Schaffer², Neeli Mishra³, Nathaniel J Miska⁴; International Brain Laboratory; Liam Paninski^{1

3}, Matthew R Whiteway³

Affiliations

¹ Department of Statistics, Columbia University.
² Icahn School of Medicine, Mount Sinai.
³ Zuckerman Institute, Columbia University.
⁴ Sainsbury Wellcome Centre, University College London.

PMID: 39108294
PMCID: PMC11302674

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Ari Blau et al. ArXiv. 2024.

[Preprint]. 2024 Dec 17:arXiv:2407.16727v2.

Authors

Ari Blau¹, Evan S Schaffer², Neeli Mishra³, Nathaniel J Miska⁴; International Brain Laboratory; Liam Paninski^{1

3}, Matthew R Whiteway³

Affiliations

¹ Department of Statistics, Columbia University.
² Icahn School of Medicine, Mount Sinai.
³ Zuckerman Institute, Columbia University.
⁴ Sainsbury Wellcome Centre, University College London.

PMID: 39108294
PMCID: PMC11302674

Update in

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms.
Blau A, Schaffer ES, Mishra N, Miska NJ; International Brain Laboratory; Paninski L, Whiteway MR. Blau A, et al. Neuron Behav Data Anal Theory. 2024;2024:10.51628/001c.127770. doi: 10.51628/001c.127770. Epub 2024 Dec 20. Neuron Behav Data Anal Theory. 2024. PMID: 40843338 Free PMC article.

Abstract

Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms - which include tree-based models, deep neural networks, and graphical models - differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species - fly, mouse, and human - we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.

PubMed Disclaimer

Figures

**FIGURE 1. Overview of the action segmentation pipeline.**
Raw sensor data (e.g. video) is collected, then features are extracted (e.g. pose estimates), then an action segmentation model is trained to map those features to a discrete behavioral class for each frame.

**FIGURE 2. Overview of action segmentation models.**
A: *Top*: Graphical model for supervised classification. Both discrete states $y_{t}$ and poses $x_{t}$ are observed. *Bottom*: Inference network for the supervised model. We use a window of observed behavioral features for state prediction. B: *Top*: Graphical model for an unsupervised recurrent switching dynamical system. The set of discrete states $\{y_{t}\}$ and continuous latents $\{z_{t}\}$ are unobserved. *Bottom*: The inference network uses a window of observed behavioral features to create a deterministic hidden representation $h_{t}$ (purple arrows); this is then used to predict the continuous latents $z_{t}$ (blue arrows) and discrete latents $y_{t}$ (red arrows). Note that the purple and red arrows together define a classifier for the discrete state at each time step. C: Graphical model and inference network for a semi-supervised recurrent switching dynamical system. A subset of the discrete states are observed. During inference, the observed discrete state is used for the inference of $z_{t}$ when possible.

**FIGURE 3. Supervised vs semi-supervised results for the head-fixed fly.**
A: Example frame of the fly, overlaid with pose markers. B: Proportion of each labeled behavior in the training dataset. C: Sample of ground truth labels, along with predictions from both the TCN and the S³LDS models. Below is a subset of the corresponding features used as inputs to the models. D: F1 scores for the TCN and S³LDS models. We show results for the position features (solid lines) as well as the position-velocity features (dashed lines). Adding velocity improves performance for both models. The number of unlabeled frames used in the models with the smallest number of labeled frames is displayed in the upper right corner of the graph; this number decreases as we add labels for each consecutive set of models. Error bars represent the standard deviation of the F1 scores over five subsamples of the training data. E: Confusion matrices for the TCN and S³LDS models. F: Average entropy of the false positives (left) and true positives (right) for both models. Entropy results for the other datasets are shown in Fig. S3. Panels E and F show results from the models trained on all labeled frames with position-velocity features.

**FIGURE 4. Supervised vs semi-supervised results across datasets.**
Conventions as in Fig. 3. As in the head-fixed fly, we find that using position-velocity features improves performance over the position features across both model types, and in all datasets the TCN performs best. A: Results on the freely moving mouse dataset. Rather than using the raw poses, we compute the features introduced in Sturman et al. (2020). These features compute transformations on the poses, including distances and angles between different groups of keypoints. B: Results on the head-fixed mouse dataset. C: Results on the HuGaDB dataset. The data is collected from sensors that already contain velocity data, so we only use one set of features.

**FIGURE 5. Supervised and semi-supervised latent spaces more closely align with labels than unsupervised latents (head-fixed fly).**
All models use position-velocity features and all available training videos for the head-fixed fly dataset. A: The top row shows a segment of ground truth labels. The following two rows show predictions from the TCN and S³LDS models. The third row shows the state outputs of keypoint-MoSeq (KPM), aligned to the ground truth class with highest overlap on the training data. The final row shows the raw state outputs of keypoint-MoSeq. B: F1 scores for the TCN, S³LDS and KPM models. Error bars represent the standard deviation of the F1 scores over five trained models (different initialization seeds). C: 2D UMAP embedding of continuous latents colored by discrete labels for three different models. D: The addition of hand labels produces more homogeneous clusters in the models’ latent spaces. Error bars represent the standard deviation of the cluster scores over five models. We use a range of cluster numbers to show that cluster scores are not biased by cluster size.

**FIGURE 6. Keypoint-MoSeq performance on non-fly datasets: position-velocity features.**
Models are trained with position-velocity features for all datasets. The mouse datasets (panels A and B) use position-velocity features, while the HuGaDB dataset uses inertial sensor data (panel C). Other conventions as in Fig. 5. As in the fly dataset, we find the TCN, which is purely supervised, achieves the highest alignment of the latent space with the ground truth labels as measured by the cluster homogeneity score.

See this image and copyright information in PMC

References

1. Ackerson G. A. and Fu K.-S. (1970) On state estimation in switching environments. IEEE Transactions on Automatic Control.
1. Anderson D. J. and Perona P. (2014) Toward a science of computational ethology. Neuron, 84, 18–31. - PubMed
1. Azabou M., Mendelson M., Ahad N., Sorokin M., Thakoor S., Urzay C. and Dyer E. (2024) Relax, it doesn’t matter how you get there: A new self-supervised approach for multi-timescale behavior analysis. Advances in Neural Information Processing Systems, 36.
1. Batty E., Whiteway M., Saxena S., Biderman D., Abe T., Musall S., Gillis W., Markowitz J., Churchland A., Cunningham J. P. et al. (2019) Behavenet: nonlinear embedding and bayesian neural decoding of behavioral videos. Advances in Neural Information Processing Systems, 32.
1. Berman G. J., Choi D. M., Bialek W. and Shaevitz J. W. (2014) Mapping the stereotyped behaviour of freely moving fruit flies. Journal of The Royal Society Interface, 11, 20140672. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Affiliations

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous