Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Dec 17:arXiv:2407.16727v2.

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Affiliations

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Ari Blau et al. ArXiv. .

Update in

Abstract

Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms - which include tree-based models, deep neural networks, and graphical models - differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species - fly, mouse, and human - we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1. Overview of the action segmentation pipeline.
Raw sensor data (e.g. video) is collected, then features are extracted (e.g. pose estimates), then an action segmentation model is trained to map those features to a discrete behavioral class for each frame.
FIGURE 2
FIGURE 2. Overview of action segmentation models.
A: Top: Graphical model for supervised classification. Both discrete states yt and poses xt are observed. Bottom: Inference network for the supervised model. We use a window of observed behavioral features for state prediction. B: Top: Graphical model for an unsupervised recurrent switching dynamical system. The set of discrete states yt and continuous latents zt are unobserved. Bottom: The inference network uses a window of observed behavioral features to create a deterministic hidden representation ht (purple arrows); this is then used to predict the continuous latents zt (blue arrows) and discrete latents yt (red arrows). Note that the purple and red arrows together define a classifier for the discrete state at each time step. C: Graphical model and inference network for a semi-supervised recurrent switching dynamical system. A subset of the discrete states are observed. During inference, the observed discrete state is used for the inference of zt when possible.
FIGURE 3
FIGURE 3. Supervised vs semi-supervised results for the head-fixed fly.
A: Example frame of the fly, overlaid with pose markers. B: Proportion of each labeled behavior in the training dataset. C: Sample of ground truth labels, along with predictions from both the TCN and the S3LDS models. Below is a subset of the corresponding features used as inputs to the models. D: F1 scores for the TCN and S3LDS models. We show results for the position features (solid lines) as well as the position-velocity features (dashed lines). Adding velocity improves performance for both models. The number of unlabeled frames used in the models with the smallest number of labeled frames is displayed in the upper right corner of the graph; this number decreases as we add labels for each consecutive set of models. Error bars represent the standard deviation of the F1 scores over five subsamples of the training data. E: Confusion matrices for the TCN and S3LDS models. F: Average entropy of the false positives (left) and true positives (right) for both models. Entropy results for the other datasets are shown in Fig. S3. Panels E and F show results from the models trained on all labeled frames with position-velocity features.
FIGURE 4
FIGURE 4. Supervised vs semi-supervised results across datasets.
Conventions as in Fig. 3. As in the head-fixed fly, we find that using position-velocity features improves performance over the position features across both model types, and in all datasets the TCN performs best. A: Results on the freely moving mouse dataset. Rather than using the raw poses, we compute the features introduced in Sturman et al. (2020). These features compute transformations on the poses, including distances and angles between different groups of keypoints. B: Results on the head-fixed mouse dataset. C: Results on the HuGaDB dataset. The data is collected from sensors that already contain velocity data, so we only use one set of features.
FIGURE 5
FIGURE 5. Supervised and semi-supervised latent spaces more closely align with labels than unsupervised latents (head-fixed fly).
All models use position-velocity features and all available training videos for the head-fixed fly dataset. A: The top row shows a segment of ground truth labels. The following two rows show predictions from the TCN and S3LDS models. The third row shows the state outputs of keypoint-MoSeq (KPM), aligned to the ground truth class with highest overlap on the training data. The final row shows the raw state outputs of keypoint-MoSeq. B: F1 scores for the TCN, S3LDS and KPM models. Error bars represent the standard deviation of the F1 scores over five trained models (different initialization seeds). C: 2D UMAP embedding of continuous latents colored by discrete labels for three different models. D: The addition of hand labels produces more homogeneous clusters in the models’ latent spaces. Error bars represent the standard deviation of the cluster scores over five models. We use a range of cluster numbers to show that cluster scores are not biased by cluster size.
FIGURE 6
FIGURE 6. Keypoint-MoSeq performance on non-fly datasets: position-velocity features.
Models are trained with position-velocity features for all datasets. The mouse datasets (panels A and B) use position-velocity features, while the HuGaDB dataset uses inertial sensor data (panel C). Other conventions as in Fig. 5. As in the fly dataset, we find the TCN, which is purely supervised, achieves the highest alignment of the latent space with the ground truth labels as measured by the cluster homogeneity score.

References

    1. Ackerson G. A. and Fu K.-S. (1970) On state estimation in switching environments. IEEE Transactions on Automatic Control.
    1. Anderson D. J. and Perona P. (2014) Toward a science of computational ethology. Neuron, 84, 18–31. - PubMed
    1. Azabou M., Mendelson M., Ahad N., Sorokin M., Thakoor S., Urzay C. and Dyer E. (2024) Relax, it doesn’t matter how you get there: A new self-supervised approach for multi-timescale behavior analysis. Advances in Neural Information Processing Systems, 36.
    1. Batty E., Whiteway M., Saxena S., Biderman D., Abe T., Musall S., Gillis W., Markowitz J., Churchland A., Cunningham J. P. et al. (2019) Behavenet: nonlinear embedding and bayesian neural decoding of behavioral videos. Advances in Neural Information Processing Systems, 32.
    1. Berman G. J., Choi D. M., Bialek W. and Shaevitz J. W. (2014) Mapping the stereotyped behaviour of freely moving fruit flies. Journal of The Royal Society Interface, 11, 20140672. - PMC - PubMed

Publication types

LinkOut - more resources