Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 30:10:e63720.
doi: 10.7554/eLife.63720.

The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice

Affiliations

The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice

Cristina Segalin et al. Elife. .

Abstract

The study of naturalistic social behavior requires quantification of animals' interactions. This is generally done through manual annotation-a highly time-consuming and tedious process. Recent advances in computer vision enable tracking the pose (posture) of freely behaving animals. However, automatically and accurately classifying complex social behaviors remains technically challenging. We introduce the Mouse Action Recognition System (MARS), an automated pipeline for pose estimation and behavior quantification in pairs of freely interacting mice. We compare MARS's annotations to human annotations and find that MARS's pose estimation and behavior classification achieve human-level performance. We also release the pose and annotation datasets used to train MARS to serve as community benchmarks and resources. Finally, we introduce the Behavior Ensemble and Neural Trajectory Observatory (BENTO), a graphical user interface for analysis of multimodal neuroscience datasets. Together, MARS and BENTO provide an end-to-end pipeline for behavior data extraction and analysis in a package that is user-friendly and easily modifiable.

Keywords: computer vision; machine learning; microendoscopic imaging; mouse; neuroscience; pose estimation; social behavior; software.

PubMed Disclaimer

Conflict of interest statement

CS, JW, TK, MH, MZ, JS, PP, DA, AK No competing interests declared

Figures

Figure 1.
Figure 1.. The Mouse Action Recognition System (MARS) data pipeline.
(A) Sample use strategies of MARS, including either out-of-the-box application or fine-tuning to custom arenas or behaviors of interest. (B) Overview of data extraction and analysis steps in a typical neuroscience experiment, indicating contributions to this process by MARS and Behavior Ensemble and Neural Trajectory Observatory (BENTO). (C) Illustration of the four stages of data processing included in MARS.
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Mouse Action Recognition System (MARS) camera positioning and sample frames.
(A) Contents of the home cage and positioning of cameras for data collection. (B) Sample top- and front-view frames from mice with and without head-attached cables, including representative examples of occlusion and motion blur in the dataset (bottom row of images).
Figure 1—figure supplement 2.
Figure 1—figure supplement 2.. The Mouse Action Recognition System (MARS) annotation dataset.
Number of hours scored for each behavior in the 14.2 hr MARS dataset, broken down by training, validation, and test sets.
Figure 1—figure supplement 3.
Figure 1—figure supplement 3.. Mouse Action Recognition System (MARS) graphical user interface.
(1) File navigator, supporting queueing of multiple jobs while tracking is running. (2) User options: specify video source (top-/front-view camera), type of features to extract, and analyses to perform (pose estimation, feature extraction, behavior classification, video output). (3) Display of status updates during analysis. (4) Progress bars for current video and for all jobs in the queue.
Figure 2.
Figure 2.. Quantifying human annotation variability in top- and front-view pose estimates.
(A, B) Anatomical keypoints labeled by human annotators in (A) top-view and (B) front-view movie frames. (C, D) Comparison of annotator labels in (C) top-view and (D) front-view frames. Top row: left, crop of original image shown to annotators (annotators were always provided with the full video frame), right, approximate figure of the mouse (traced for clarity). Middle-bottom rows: keypoint locations provided by three example annotators, and the extracted ‘ground truth’ from the median of all annotations. (E, F) Ellipses showing variability of human annotations of each keypoint in one example frame from (E) top view and (F) front view (N = 5 annotators, 1 standard deviation ellipse radius). (G, H) Variability in human annotations of mouse pose for the top-view video, plotted as the percentage of human annotations falling within radius X of ground truth for (G) top-view and (H) front-view frames.
Figure 3.
Figure 3.. Performance of the mouse detection network.
(A) Processing stages of mouse detection pipeline. (B) Illustration of intersection over union (IoU) metric for the top-view video. (C) Precision-recall (PR) curves for multiple IoU thresholds for detection of the two mice in the top-view video. (D) Illustration of IoU for the front-view video. (E) PR curves for multiple IoU thresholds in the front-view video.
Figure 4.
Figure 4.. Performance of the stacked hourglass network for pose estimation.
(A) Processing stages of pose estimation pipeline. (B) Mouse Action Recognition System (MARS) accuracy for individual body parts, showing performance for videos with vs. without a head-mounted microendoscope or fiber photometry cable on the black mouse. Gray envelop shows the accuracy of the best vs. worst human annotations; dashed black line is median human accuracy. (C) Histogram of object keypoint similarity (OKS) scores across frames in the test set. Blue bars: normalized by human annotation variability; orange bars, normalized using a fixed variability of 0.025 (see Materials and methods). (D) MARS accuracy for individual body parts in front-view videos with vs. without microendoscope or fiber photometry cables. (E) Histogram of OKS scores for the front-view camera. (F) Sample video frames (above) and MARS pose estimates (below) in cases of occlusion and motion blur.
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Breakdown of Mouse Action Recognition System (MARS) keypoint errors for top- and front-view pose models.
Left: precision/recall curves as a function of object keypoint similarity (OKS) cutoff; area under the curve indicated in legend. Right: breakdown of error sources and their effect on precision/recall curve at an OKS cutoff of 0.85. Error types are as defined in Ruggero Ronchi and Perona, 2017. Classes of keypoint position errors: Miss: large localization error; Swap: confusion between similar parts of different instances (animals); Inversion: confusion between semantically similar parts of the same instance (e.g., left/right ears); Jitter: small localization errors; Opt Score: mis-ranking of predictions by confidence (not relevant); Bkg FP: performance after removing background false positives; b: performance after removing false negatives.
Figure 5.
Figure 5.. Quantifying inter-annotator variability in behavior annotations.
(A) Example annotation for attack, mounting, and close investigation behaviors by six trained annotators on segments of male-female (top) and male-male (bottom) interactions. (B) Inter-annotator variability in the total reported time mice spent engaging in each behavior. (C) Inter-annotator variability in the number of reported bouts (contiguous sequences of frames) scored for each behavior. (D) Precision and recall of annotators (humans) 2–6 with respect to annotations by human 1.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Expanded set of human annotations.
All panels as in Figure 5, but with the two omitted annotators (humans 7 and 8) included. (A) Example annotation for attack, mounting, and close investigation behaviors by eight trained annotators on segments of male-female (top) and male-male (bottom) interactions. (B) Inter-annotator variability in the total reported time mice spent engaging in each behavior. (C) Inter-annotator variability in the number of reported bouts (contiguous sequences of frames) scored for each behavior. (D) Precision and recall of annotators (humans) 2–8 with respect to annotations by human 1 (source of Mouse Action Recognition System [MARS] behavior classifier training annotations).
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. Within-annotator bias and variance in annotation of attack start time.
Annotations of all attack bouts in the 10-video dataset by six human annotators. All attack bouts are aligned to the first frame on which at least three human annotators scored attack as occurring. Colored dots then reflect the time when each annotator scored each bout as starting, relative to this aligned time (the group median). Each annotator shows a characteristic bias (a shift in their mean annotation start time before or after the group median) and variance (the spread of annotation start times around this mean) in their annotation style. Some annotators did not score any attack initiated within a ±1 s window of the group median for a given bout: these points are plotted at time –1. Note that the average attack bout in the dataset is 1.65 s long (using annotations from human 1).
Figure 5—figure supplement 3.
Figure 5—figure supplement 3.. Inter-annotator accuracy on individual videos.
(A) Mean precision and recall of annotators 1–6, computed relative to the median of the other five annotators (mean ± SEM). Each plotted point is one video. (B) Mean annotator F1 score (harmonic mean of precision and recall) plotted against the mean bout duration for each behavior in each video. Plot suggests a close positive correlation between the average duration of behavior bouts in a video (or dataset) and the accuracy of annotators as computed by precision and recall. (C) Mean annotator F1 score plotted against the total number of frames annotated for a given behavior in each video. Correlation is weaker than in (B).
Figure 5—figure supplement 4.
Figure 5—figure supplement 4.. Inter- and intra-annotator variability.
We asked eight individuals to all annotate a pair of 10 min videos twice, with at least 10 months between annotation sessions. Box plots in (B) and (D) show median (line), 25th to 75th percentiles (box), and minimum/maximum values (whiskers). *p<0.05, **p<0.01, ***p<0.001; effect sizes computed as U/(n1 * n2), where n1 and n2 are category sample sizes. (A) F1 score within and between annotators: we treated a given annotator (X axis) as ground truth and computed F1 score of each annotator with respect to these labels (for self-comparison, we used the first annotation session as ground truth and the second as ‘prediction’). (B) Summary of F1 score values in (A), showing mean F1 score vs. self and vs. other across annotators (attack self vs. other: p=0.00623, effect size = 0.00623, Wilcox rank sum test, N = 6 self vs. 15 other; close investigation self vs. other: p=0.0292, effect size = 0.811, Wilcox rank sum test, N = 6 self vs. 15 other). (C, D) Same as in (A), but including two additional annotators who were more variable (attack self vs. other: p=0.000498, effect size = 0.911, Wilcox rank sum test, N = 8 self vs. 28 other; close investigation self vs. other: p=0.00219, effect size = 0.862, Wilcox rank sum test, N = 8 self vs. 28 other). (E) Same data as in (C) displayed as a matrix to capture annotator identity.
Figure 6.
Figure 6.. Performance of behavior classifiers.
(A) Processing stages of estimating behavior from pose of both mice. (B) Example output of the Mouse Action Recognition System (MARS) behavior classifiers on segments of male-female and male-male interactions compared to annotations by human 1 (source of classifier training data) and to the median of the six human annotators analyzed in Figure 5. (C) Precision, recall, and precision-recall (PR) curves of MARS with respect to human 1 for each of the three behaviors. (D) Precision, recall, and PR curves of MARS with respect to the median of the six human annotators (precision/recall for each human annotator was computed with respect to the median of the other five). (E) Mean precision and recall of human annotators vs. MARS, relative to human 1 and relative to the group median (mean ± SEM).
Figure 6—figure supplement 1.
Figure 6—figure supplement 1.. Mouse Action Recognition System (MARS) precision and recall is closely correlated with that of annotators on individual videos.
(A) Mean precision and recall of annotators 1–6 for each behavior in each of the 10 tested videos (plotted points; as in Figure 5—figure supplement 3), and MARS precision-recall (PR) curves for those videos. PR curves and points that are the same color correspond to the same video. (B) Mean annotator F1 score plotted against MARS’s F1 score for each behavior in each video. Performance of MARS is well predicted by the inter-human F1 score, which is in turn correlated with mean behavior bout duration (see Figure 5—figure supplement 3).
Figure 6—figure supplement 2.
Figure 6—figure supplement 2.. Evaluation of Mouse Action Recognition System (MARS) on a larger test set.
(A) Precision-recall (PR) curves of MARS classifiers for test set 1 (‘no cable’), test set 2 (‘with cable’), and for the two sets combined. (B) F1 score of MARS classifiers for each behavior in each video, plotted against mean behavior bout duration in that video. Plots show no strong difference in performance between videos in which mice are unoperated (‘no cable’) and videos in which mice are implanted with a head-attached device (‘cable’).
Figure 6—figure supplement 3.
Figure 6—figure supplement 3.. Training Mouse Action Recognition System (MARS) on new datasets.
(A) Sample frame from CRIM13 dataset. (B) Performance of MARS pose estimator fine-tuned to CRIM13 data as a function of fine-tuning training set size. (C) 90% percent correct keypoints (PCK) radius on CRIM13 data as a function of training set size. (D) Performance of MARS classifiers for three additional social behaviors as a function of training set size (number of frames annotated for the behavior of interest). (E) Same classifiers as in (D), now showing performance as a function of the number of bouts annotated for the behavior of interest.
Figure 7.
Figure 7.. Screenshot of the Behavior Ensemble and Neural Trajectory Observatory (BENTO) user interface.
(A, left) The main user interface showing synchronous display of video, pose estimation, neural activity, and pose feature data. (Right) List of data types that can be loaded and synchronously displayed within BENTO. (B) BENTO interface for creating annotations based on thresholded combinations of Mouse Action Recognition System (MARS) pose features.
Figure 8.
Figure 8.. Application of Mouse Action Recognition System (MARS) in a large-scale behavioral assay.
All plots: mean ± SEM, N = 8–10 mice per genotype per line (83 mice total); *p<0.05, **p<0.01, ***p<0.001. (A) Assay design. (B) Time spent attacking by group-housed (GH) and single-housed (SH) mice from each line compared to controls (Chd8 GH het vs. ctrl: p=0.0367, Cohen’s d = 1.155, two-sample t-test, N = 8 het vs. 8 ctrl; Nlgn3 het GH vs. SH: p=0.000449, Cohen’s d = 1.958, two-sample t-test, N = 10 GH vs. 8 SH). (C) Time spent engaged in close investigation by each condition/line (BTBR SH BTBR vs. ctrl: p=0.0186, Cohen’s d = 1.157, two-sample t-test, N = 10 BTBR vs. 10 ctrl). (D) Cartoon showing segmentation of close investigation bouts into face-, body-, and genital-directed investigation. Frames are classified based on the position of the resident’s nose relative to a boundary midway between the intruder mouse’s nose and neck, and a boundary midway between the intruder mouse’s hips and tail base. (E) Average duration of close investigation bouts in BTBR mice for investigation as a whole and broken down by the body part investigated (close investigation, p=0.00023, Cohen’s d = 2.05; face-directed p=0.00120, Cohen’s d = 1.72; genital-directed p=0.0000903, Cohen’s d = 2.24; two-sample t-test, N = 10 het vs. 10 ctrl for all).
Figure 9.
Figure 9.. Analysis of a microendoscopic imaging dataset using Mouse Action Recognition System (MARS) and Behavior Ensemble and Neural Trajectory Observatory (BENTO).
(A) Schematic of the imaging setup, showing head-mounted microendoscope. (B) Sample video frame with MARS pose estimate, showing appearance of the microendoscope and cable during recording. (C) Sample behavior-triggered average figure produced by BENTO. (Top) Mount-triggered average response of one example neuron within a 30 s window (mean ± SEM). (Bottom) Individual trials contributing to mount-triggered average, showing animal behavior (colored patches) and neuron response (black lines) on each trial. The behavior-triggered average interface allows the user to specify the window considered during averaging (here 10 s before to 20 s after mount initiation), whether to merge behavior bouts occurring less than X seconds apart, whether to trigger on behavior start or end, and whether to normalize individual trials before averaging; results can be saved as a pdf or exported to the MATLAB workspace. (D) Normalized mount-triggered average responses of 28 example neurons in the medial preoptic area (MPOA), identified using BENTO. Grouping of neurons reveals diverse subpopulations of cells responding at different times relative to the onset of mounting (pink dot = neuron shown in panel C).

References

    1. Andriluka M, Pishchulin L, Gehler P, Schiele B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014. pp. 3686–3693. - DOI
    1. Berman GJ, Choi DM, Bialek W, Shaevitz JW. Mapping the stereotyped behaviour of freely moving fruit flies. Journal of the Royal Society, Interface. 2014;11:20140672. doi: 10.1098/rsif.2014.0672. - DOI - PMC - PubMed
    1. Blanchard DC, Griebel G, Blanchard RJ. The Mouse Defense Test Battery: pharmacological and behavioral assays for anxiety and panic. European Journal of Pharmacology. 2003;463:97–116. doi: 10.1016/s0014-2999(03)01276-7. - DOI - PubMed
    1. Branson K, Robie AA, Bender J, Perona P, Dickinson MH. High-throughput ethomics in large groups of Drosophila. Nature Methods. 2009;6:451–457. doi: 10.1038/nmeth.1328. - DOI - PMC - PubMed
    1. Burgos-Artizzu XP, Dollár P, Lin D, Anderson DJ, Perona P. Social behavior recognition in continuous video. IEEE Conference on Computer Vision and Pattern Recognition; 2012. pp. 1322–1329. - DOI

Publication types