Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Dec 23:2023.03.16.532307.
doi: 10.1101/2023.03.16.532307.

Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics

Affiliations

Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics

Caleb Weinreb et al. bioRxiv. .

Update in

  • Keypoint-MoSeq: parsing behavior by linking point tracking to pose dynamics.
    Weinreb C, Pearl JE, Lin S, Osman MAM, Zhang L, Annapragada S, Conlin E, Hoffmann R, Makowska S, Gillis WF, Jay M, Ye S, Mathis A, Mathis MW, Pereira T, Linderman SW, Datta SR. Weinreb C, et al. Nat Methods. 2024 Jul;21(7):1329-1339. doi: 10.1038/s41592-024-02318-2. Epub 2024 Jul 12. Nat Methods. 2024. PMID: 38997595 Free PMC article.

Abstract

Keypoint tracking algorithms have revolutionized the analysis of animal behavior, enabling investigators to flexibly quantify behavioral dynamics from conventional video recordings obtained in a wide variety of settings. However, it remains unclear how to parse continuous keypoint data into the modules out of which behavior is organized. This challenge is particularly acute because keypoint data is susceptible to high frequency jitter that clustering algorithms can mistake for transitions between behavioral modules. Here we present keypoint-MoSeq, a machine learning-based platform for identifying behavioral modules ("syllables") from keypoint data without human supervision. Keypoint-MoSeq uses a generative model to distinguish keypoint noise from behavior, enabling it to effectively identify syllables whose boundaries correspond to natural sub-second discontinuities inherent to mouse behavior. Keypoint-MoSeq outperforms commonly used alternative clustering methods at identifying these transitions, at capturing correlations between neural activity and behavior, and at classifying either solitary or social behaviors in accordance with human annotations. Keypoint-MoSeq therefore renders behavioral syllables and grammar accessible to the many researchers who use standard video to capture animal behavior.

PubMed Disclaimer

Conflict of interest statement

Competing interests S.R.D. sits on the scientific advisory boards of Neumora and Gilgamesh Therapeutics, which have licensed or sub-licensed the MoSeq technology.

Figures

Extended Data Figure 1:
Extended Data Figure 1:. Mouse behavior exhibits sub-second syllable structure when keypoints are tracked from below.
a) 2D keypoints tracked using infrared video from a camera viewing the mouse through a transparent floor. b) Egocentrically aligned keypoint trajectories (bottom) and change scores derived from those keypoints (top, see Methods). Vertical dashed lines represent changepoints (peaks in the change score). c) Distribution of inter-changepoint intervals. d) Keypoint change score aligned to syllable transitions from depth MoSeq. Results in (c) and (d) are shown for the full dataset (black lines) and for each recording session (gray lines).
Extended Data Figure 2:
Extended Data Figure 2:. Markerless pose tracking exhibits fast fluctuations are that are independent of behavior yet affect MoSeq output.
a) Noise-driven fast fluctuations are pervasive across camera angles and tracking methods. Cross-correlation between the spectral content of keypoint fluctuations and either error magnitude (left) or a measure of low-confidence keypoint detections (right) (see Methods). b-d) Tracking noise reflects ambiguity in keypoint locations. b) Magnitude of fast fluctuations in keypoint position for three different tracking methods, calculated as the per-frame distance from the measured trajectory of a keypoint to a smoothened version of the same trajectory, where smoothing was performed using a gaussian kernel with width 100ms. c) Inter-annotator variability, shown as the distribution of distances between different annotations of the same keypoint. d) Train- and test- error distributions for each keypoint tracking method. e) Fast fluctuations are weakly correlated between camera angles. Top: position of the nose and tail-base over a 10-second interval, shown for both the overhead and below-floor cameras. Bottom: fast fluctuations in each coordinate, obtained as residuals after median filtering. f) Cross-correlation between spectrograms obtained from two different camera angles for either the tail base or the nose, shown for each tracking method. g) Filtering keypoint trajectories does not improve MoSeq output. Cross-correlation of transitions rates, comparing MoSeq (depth) and MoSeq applied to keypoints with various levels of smoothing using either a Gaussian or median filter.
Extended Data Figure 3:
Extended Data Figure 3:. Keypoint-MoSeq partitions behavior into distinct, well-defined syllables.
a) Keypoint-MoSeq and depth MoSeq yield similar duration distributions. Relationship between mean and median syllable duration as the temporal stickiness hyper-parameter κ is varied, shown for keypoint-MoSeq (red dots), as well as original MoSeq applied to depth (dashed line) or keypoints (solid line). b) Keypoint-MoSeq syllables represent distinguishable pose trajectories. Syllable cross-likelihoods, defined as the probability, on average, that time-intervals assigned to one syllable (column) could have arisen from another syllable (row). Cross-likelihoods were calculated for keypoint-MoSeq and for depth MoSeq. The results for both methods are plotted twice, using either an absolute scale (left) or a log scale (right). Note that the off-diagonal cross-likelihoods apparent for keypoint-MoSeq on the log scale are practically negligible; we show them here to emphasize that MoSeq models have higher uncertainty when fed lower dimensional data like keypoints compared to depth data. c) Keypoint-MoSeq fails to distinguish syllables when input data lacks changepoints. Modeling results for synthetic keypoint data with a similar statistical structure as the real data but lacking in changepoints (see Methods). Left: example of synthetic keypoint trajectories. Middle: autocorrelation of keypoint coordinates for real vs. synthetic data, showing similar dynamics at short timescales. Right: distribution of syllable frequencies for keypoint-MoSeq models trained on real vs. synthetic data. d-e) Syllable-associated kinematics. d) Average pose trajectories for syllables identified by keypoint-MoSeq. Each trajectory includes ten evenly timed poses from 165ms before to 500ms after syllable onset. e) Kinematic and morphological parameters for each syllable.
Extended Data Figure 4:
Extended Data Figure 4:. Method-to-method differences in sensitivity to behavioral changepoints are robust to parameter settings.
a) Output of unsupervised behavior segmentation algorithms across a range of parameter settings, applied to 2D keypoint data from two different camera angles. The median state duration (left) and the average (z-scored) keypoint change score aligned to state transitions (right) are shown for each method and parameter value. Gray pointers indicate default parameter values used for subsequent analysis. b) Distributions showing the number of transitions that occur during each rear. c) Accuracy of kinematic decoding models that were fit to state sequences from each method.
Extended Data Figure 5:
Extended Data Figure 5:. Accelerometry reveals kinematic transitions at the onsets of keypoint-MoSeq states.
a) IMU signals aligned to state onsets from several behavior segmentation methods. Each row corresponds to a behavior state and shows the average across all onset times for that state. b) As (a) for acceleration but showing the median across all states.
Extended Data Figure 6:
Extended Data Figure 6:. Striatal dopamine fluctuations are enriched at keypoint-MoSeq syllable onsets.
a) Keypoint-MoSeq best captures dopamine fluctuations for both high- and low-velocity behaviors. Derivative of the dopamine signal aligned to the onsets of high velocity or low velocity behavior states. States from each method were classified evenly as high or low velocity based on the mean centroid velocity during their respective frames. b) Distributions capturing the average of the dopamine signal across states from each method. c-d) Keypoint-MoSeq syllable onsets are meaningful landmarks for neural data analysis. c) Relationship between state durations and correlations from Fig 5f, showing that the impact of randomization is not a simple function of state duration. d) Average dopamine fluctuations aligned to state onsets (left) or aligned to random frames throughout the execution of each state (middle), as well as the absolute difference between the two alignment approaches (right), shown for each unsupervised behavior segmentation approach.
Extended Data Figure 7:
Extended Data Figure 7:. Supervised behavior benchmark.
a-d) Keypoint-MoSeq captures sub-second syllable structure in two benchmark datasets. a,b) Distribution of inter-changepoint intervals for the open field dataset (Bohnslav, 2019) (a) and CalMS21 social behavior benchmark (b), shown respectively for the full datasets (black lines) and for each recording session (gray lines). c,d) Distribution of state durations from each behavior segmentation method. e-g) Keypoint-MoSeq matches or outperforms other methods when quantifying the agreement between human-annotations and unsupervised behavior labels. e,f) Three different similarity measures applied to the output of each unsupervised behavior analysis method (see Methods). g) Number of unsupervised states specific to each human-annotated behavior in the CalMS21 dataset, shown for 20 independent fits of each unsupervised method. A state was defined as specific if > 50% of frames bore the annotation.
Extended Data Figure 8:
Extended Data Figure 8:. 3D and 2D keypoints provide qualitatively distinct pose representations yet share sub-second temporal structure.
a) 3D keypoints have smoother trajectories and exhibit oscillatory gate dynamics. Left: Keypoints tracked in 2D (top) or 3D (bottom) and corresponding egocentric coordinate axes. Right: example keypoint trajectories and transition rates from keypoint-MoSeq. Transition rate is defined as the posterior probability of a transition occurring on each frame. b) 2D keypoints, 3D keypoints and depth data provide increasingly high-dimensional pose representations. Cumulative fraction of explained variance for increasing number of principal components (PCs). PCs were fit to egocentrically aligned 2D keypoints, egocentrically aligned 3D keypoints, or depth videos respectively. c-d) 3D keypoints capture sub-second syllable structure. c) Distribution of inter-changepoint intervals in the 3D keypoint dataset, shown. d) Cross-correlation between the 3D keypoint change score and change scores derived from 2D keypoints and depth respectively.
Extended Data Figure 9:
Extended Data Figure 9:. Keypoint-MoSeq analysis of rat motion capture data.
a) Top: 3D marker positions in egocentric coordinates. Middle: change score derived from the marker trajectories. Bottom: keypoint-MoSeq syllables. b) Random sample of centroid locations during execution of the “lever-press” syllable shown in Fig 6o.
Figure 1:
Figure 1:. Keypoint trajectories exhibit sub-second to second structure during spontaneous behavior.
a) Left: sample frame from simultaneous depth and 2D infrared recordings. Right: centered and aligned pose representations using the depth data (top) or infrared (bottom, tracked keypoints indicated). b-c) Features extracted from depth or 2D keypoint data within a 4-second window. All rows are temporally aligned. b) Top: Representation of the mouse’s pose based on depth video. Each row shows a random projection of the high-dimensional depth time-series. Discontinuities in the visual pattern capture abrupt changes in the mouse’s movement. Middle: Rate of change in the depth signal as quantified by a change score (see Methods). Bottom: color-coded syllable sequence from MoSeq applied to the depth data [referred to as “MoSeq (depth)”]. c) Position of each keypoint in egocentric coordinates; vertical lines mark changepoints, defined as peaks in the keypoint change score. d) Left: average keypoint change score (z-scored) aligned to MoSeq (depth) transitions (gray), or to changepoints in the depth signal (black). Middle: cross-correlation between depth- and keypoint-change scores, shown for the whole dataset (black line) and for each session (gray lines). Right: Distribution of syllable durations, based either on modeling or changepoint analysis.
Figure 2:
Figure 2:. Keypoint tracking noise challenges syllable inference.
a) Applying traditional MoSeq to keypoint trajectories [referred to as “MoSeq (keypoints)”] produces abnormally brief syllables when compared to MoSeq applied to depth data [“MoSeq (depth)”]. b) Keypoint change scores (left) or low-confidence detection scores (right, see Methods for how low-confidence keypoint detection was quantified), relative to the onset of MoSeq transitions (x-axis) derived from either depth (grey) or keypoint data (black). c) Left: example of keypoint detection errors, including high-frequency fluctuations in keypoint coordinates (top row) that coincide with low keypoint detection confidence (bottom row). Right: keypoint coordinates before (frame1) and during (frame2) an example keypoint position assignment error. This assignment error (occurring in the tail base keypoint) causes a shift in egocentric alignment, leading to coordinate changes across the other tracked keypoints. d) A five second example behavioral interval in which the same keypoints are tracked using three different methods (indicated in the inset) reveal pervasive jitter during stillness. Left: egocentrically aligned keypoint trajectories. Right: path traced by each keypoint during the 5-second interval. e) Variability in keypoint positions assigned by eight human labelers (see Methods). f) Cross-correlation between various features and keypoint fluctuations at a range of frequencies. Each heatmap represents a different scalar time-series (such as “transition rate” – the likelihood of a syllable transition on each frame), each row shows the cross-correlation between that time-series and the time-varying power of keypoint fluctuations at a given frequency. g) Timing of syllable transitions when MoSeq is applied to smoothed keypoint data, from most smoothed (top) to least smoothed (bottom). Each row shows the cross-correlation of MoSeq transition rates between keypoints and depth (i.e., the relative timing and degree of overlap between syllable transitions from each model).
Figure 3:
Figure 3:. Hierarchical modeling of keypoint trajectories decouples noise from pose dynamics.
a) Graphical models illustrating traditional and keypoint-MoSeq. In both models, a discrete syllable sequence governs pose dynamics; these pose dynamics are either described using PCA (as in “MoSeq”, left) or are inferred from keypoint observations in conjunction with the animal’s centroid and heading, as well as a noise scale that discounts keypoint detection errors (as in “keypoint-MoSeq”, right). b) Example of error correction by keypoint-MoSeq. Left: Before fitting, all variables (y axis) are perturbed by incorrect positional assignment of the tail base keypoint (whose erroneous location is shown in the bottom inset). Right: Keypoint-MoSeq infers plausible trajectories for each variable (shading represents the 95% confidence interval). The inset shows several likely keypoint coordinates for the tail base inferred by the model. c) Top: Average values of various features aligned to syllable transitions from keypoint-MoSeq (red) vs. traditional MoSeq applied to keypoint data (black). Bottom: cross-correlation of syllable transition rates between each model and depth MoSeq. Peak height represents the relative frequency of overlap in syllable transitions. d) Duration distribution of the syllables from each of the indicated models. e) Average pose trajectories for example keypoint-MoSeq syllables. Each trajectory includes ten poses, starting 165ms before and ending 500ms after syllable onset.
Figure 4:
Figure 4:. Keypoint-MoSeq captures the temporal structure of behavior.
a) Example behavioral segmentations from four methods applied to the same 2D keypoint dataset. Keypoint-MoSeq transitions (fourth row) are sparser than those from other methods and more closely aligned to peaks in keypoint change scores (bottom row). b) Distribution of state durations for each method in (a). c) Average keypoint change scores (z-scored) relative to transitions identified by the indicated method (“MMper” refers to MotionMapper). d) Median mouse height (measured by depth camera) for each unsupervised behavior state. Rear-specific states (shaded bars) are defined as those with median height > 6cm. e) Accuracy of models designed to decode mouse height, each of which were fit to state sequences from each of the indicated methods. f) Bottom: state sequences from keypoint-MoSeq and B-SOiD during a pair of example rears. States are colored as in (d). Top: mouse height over time with rears shaded gray. Callouts show depth- and IR-views of the mouse during two example frames. g) Average mouse height aligned to the onsets (solid line) or offsets (dashed line) of rear-specific states defined in (d). h) Signals captured from a head-mounted inertial measurement unit (IMU), including absolute 3D head-orientation (top) and relative linear acceleration (bottom). Each signal and its rate of change, including angular velocity (ang. vel.) and jerk (the derivative of acceleration), is plotted during a five second interval. i) IMU signals aligned to the onsets of each behavioral state. Each heatmap row represents a state. Line plots show the median across states for angular velocity and jerk.
Figure 5:
Figure 5:. Keypoint-MoSeq syllable transitions align with fluctuations in striatal dopamine.
a) Illustration depicting simultaneous recordings of dopamine fluctuations in dorsolateral striatum (DLS) obtained from fiber photometry (top) and unsupervised behavioral segmentation of 2D keypoint data (bottom). b) Derivative of the dopamine signal aligned to state transitions from each method. c) Average dopamine signal (z-scored ΔF/F) aligned to the onset of example states identified by keypoint-MoSeq and VAME. Shading marks the 95% confidence interval around the mean. d) Distributions capturing the magnitude of state-associated dopamine fluctuations across states from each method, where magnitude is defined as mean total absolute value in a one-second window centered on state onset. e) Distributions capturing the temporal asymmetry of state-associated dopamine fluctuations, where asymmetry is defined as the difference in mean dopamine signal during 500ms after versus 500ms before state onset. f) Temporal randomization affects keypoint-MoSeq identified neuro-behavioral correlations, but not those identified by other methods. Top: schematic of randomization. The dopamine signal was either aligned to the onsets of each state, as in (c), or to random frames throughout the execution of each state. Bottom: distributions capturing the correlation of state-associated dopamine fluctuations before vs. after randomization.
Figure 6:
Figure 6:. Keypoint-MoSeq generalizes across pose representations, behaviors, and rodent species.
a) Example frame from a benchmark open field dataset (Bohnslav, 2019). b) Overall frequency of each human-annotated behavior (as %) and conditional frequencies across states inferred from unsupervised analysis of 2D keypoints. c) Normalized mutual information (NMI, see Methods) between human annotations and unsupervised behavior labels from each method. d) Example frame from the CalMS21 social behavior benchmark dataset, showing 2D keypoint annotations for the resident mouse. e-f) Overlap between human annotations and unsupervised behavior states inferred from 2D keypoint tracking of the resident mouse, as b-c. g) Multi-camera arena for simultaneous recording of 3D keypoints (3D kps), 2D keypoints (2D kps) and depth videos. h) Comparison of model outputs across tracking modalities. 2D and 3D keypoint data were modeled using keypoint-MoSeq, and depth data were modeled using original MoSeq. Left: cross-correlation of transition rates, comparing 3D keypoints to 2D keypoints and depth respectively. Middle: distribution of syllable durations; Right: number of states with frequency > 0.5%. Boxplots represent the distribution of state counts across 20 independent runs of each model. i) Probability of syllables inferred from 2D keypoints (left) or depth (right) during each 3D keypoint-based syllable. j-l) Average pose trajectories for the syllables marked in (i). k) 3D trajectories are plotted in side view (first row) and top-down view (second row). l) Average pose (as depth image) 100ms after syllable onset. m) Location of markers for rat motion capture. n) Left: Average keypoint change score (z) aligned to keypoint-MoSeq transitions. Right: Duration distributions for keypoint-MoSeq states and inter-changepoint intervals. o) Average pose trajectories for example syllables learned from rat motion capture data.

References

    1. Tinbergen N. The study of instinct. (Clarendon Press, 1951).
    1. Dawkins R. in Growing points in ethology. (Cambridge U Press, 1976).
    1. Baerends G. P. The functional organization of behaviour. Animal Behaviour 24, 726–738 (1976). https://doi.org: 10.1016/S0003-3472(76)80002-4 - DOI
    1. Pereira T. D. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nature Methods 19, 486–495 (2022). 10.1038/s41592-022-01426-1 - DOI - PMC - PubMed
    1. Mathis A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Publishing Group 21, 1281–1289 (2018). https://doi.org:papers3://publication/doi/10.1038/s41593-018-0209-y - DOI - PubMed

Publication types