Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 8;14(1):20914.
doi: 10.1038/s41598-024-71665-z.

An open-source tool for automated human-level circling behavior detection

Affiliations

An open-source tool for automated human-level circling behavior detection

O R Stanley et al. Sci Rep. .

Abstract

Quantitatively relating behavior to underlying biology is crucial in life science. Although progress in keypoint tracking tools has reduced barriers to recording postural data, identifying specific behaviors from this data remains challenging. Manual behavior coding is labor-intensive and inconsistent, while automatic methods struggle to explicitly define complex behaviors, even when they seem obvious to the human eye. Here, we demonstrate an effective technique for detecting circling in mice, a form of locomotion characterized by stereotyped spinning. Despite circling's extensive history as a behavioral marker, there currently exists no standard automated detection method. We developed a circling detection technique using simple postprocessing of keypoint data obtained from videos of freely-exploring (Cib2-/-;Cib3-/-) mutant mice, a strain previously found to exhibit circling behavior. Our technique achieves statistical parity with independent human observers in matching occurrence times based on human consensus, and it accurately distinguishes between videos of wild type mice and mutants. Our pipeline provides a convenient, noninvasive, quantitative tool for analyzing circling mouse models without the need for software engineering experience. Additionally, as the concepts underlying our approach are agnostic to the behavior being analyzed, and indeed to the modality of the recorded data, our results support the feasibility of algorithmically detecting specific research-relevant behaviors using readily-interpretable parameters tuned on the basis of human consensus.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Data collection conditions and analysis pipeline. We collected videos of five wild-type and five (Cib2/;Cib3/) dual knockout mice exploring a 30 cm-diameter cylindrical arena. Each of 6 combinations of light and distance conditions was repeated 4 times for each mouse, resulting in a total of 236 videos as 4 became corrupted. After behavior videos were recorded, all videos of one mutant mouse and one wild-type mouse were set aside for human behavioral labeling as a test set. For each of these held-out videos, three observers independently marked occurrences of circling behavior. These behavioral labels were compared to produce a set of consensus labels on which all observers agreed. A separate training set of human behavior labels was constructed by randomly selecting 24 mutant and 24 wild-type videos from among the remaining 188 videos. Additionally, positions of the snout and tailbase were manually labeled in 20 random frames from each of these 188 videos. Manually-labeled bodypart locations were used to train a computer vision model using DeepLabCut. This trained model was then used to track animals in the human-scored videos, and the resulting paths were analyzed by three candidate circling detection algorithms. After the parameters of these algorithms were optimized for F1 score on the training set, they were applied to the test set for evaluation.
Fig. 2
Fig. 2
Human F1 scores. (A) Treating one independent observer as the gold standard for another reveals that humans show substantial variability in labeling circling behavior. In particular, although average F1 scores for each pair (AB, BC, CA) are similar (0.53, 0.52, 0.49), the distributions of scores across videos differ significantly between one pair and the other two (pair CA, p = 3.5E−2 and 1.4E−4 versus pairs AB, BC respectively) while the other pair did not differ significantly (AB versus BC, p = 0.28). (B) Scoring of independent observers' labels against another observer (left columns) or against consensus labels (agreement among 3 observers, right columns) produce similar results (p = 0.2), as does comparing between our two human data subsets (train versus test subset, p = 0.65 and 0.75). Pooled pairwise F1 scores averaged 0.51 (95% CI 0.47–0.55) in the training set and 0.53 (0.41–0.62) in the testing set. Scoring against consensus occurrences, in which all observers mark a complete circle within 0.1 s of one another, produced similar scores of 0.51 (0.44–0.57) in the training set and 0.53 (0.38–0.65). Each point in a column represents a single video. Labeler-video combinations for which F1 score is undefined (i.e., both scorer and ground truth marked no circling instances), are not displayed for either paired or consensus scoring but were included in bootstrapping for purposes of calculating confidence intervals.
Fig. 3
Fig. 3
Method parameters and performance levels. (A) Timelapse of keypoint-labeled frames of a mouse engaged in circling behavior. (B) Parameter distributions and associated exponential and Gaussian fits from two sample videos. To accommodate the substantial variability observed across videos, we relied on a two-step process of Gaussian kernel estimation followed by fitting to a weighted sum of an exponential and normal distribution. This allowed the same technique to account for differences in e.g. average duration (left column, compare blue Gaussian fits) or greater numbers of small collisions likely to be false positives (right column, compare pink exponential fits). (C) Illustration of circle detection using each of the described methods. Duration-Only considers only time taken to complete the putative circle, Time-Angle additionally calculates the angle of the tail-to-snout vector for each frame and considers its total net change, and Box-Angle removes duration requirements and instead constraints the geometry of the circle based on the axes of a rectangle bounding the candidate circling instance. (D) Examples of false-positive detections using each method. There are clear features which indicate an instance should be filtered out for the Duration-Only (minimal head movement relative to the tail) and Time-Angle (oblong or missized snout path geometry) methods.
Fig. 4
Fig. 4
Method performance comparison. After optimizing behavior detection algorithms on the human-labeled training set, each was scored on the human consensus circling labels of the test set. Each column represents one algorithm, with one dot for each test set video with a defined score. Videos for which F1 score is undefined (i.e., the automated method and human consensus both marked no circling instances) were included in confidence interval calculations but not displayed as individual datapoints. The Duration-Only and Time-Angle methods significantly underperformed independent human observers (mean and 95% CI 0.1 (0.02–0.17) and 0.22 (0.03–0.47), p = 1.1E−11 and 4.7E−6, respectively). Only the Box-Angle method reaches statistical parity (mean F1 0.43 (0.21–0.57), p = 0.51).
Fig. 5
Fig. 5
Dataset size performance comparison. (A) Labeling performance (error, in pixels) for each of 10 trained networks on datasets of progressively smaller sizes. All dataset sizes resulted in greater labeling error than the Full Dataset model (dashed horizontal line), particularly for frames not seen during training (test frames). Notably, this trend was not monotonic—the set of quarter-dataset models performed better on test frames, on average, than the set of half-dataset models. Root-mean-squared errors on training set frames were (mean and 95% CI) 9.29 (8.13–10.73), 9.84 (8.53–11.7), and 11.02 (9.11–12.91) pixels respectively. For unseen frames, these errors increased to 19.37 (16.92–22.28), 12.3 (10.51–14.4), and 14.34 (12.66–15.98). Dashed horizontal line represents Full Dataset model training frame error (7.82 pixels). (B) To determine whether these changes in labeling quality impacted, we applied the optimized Box-Angle method to the keypoint tracking produced by each network at each dataset size. Within a dataset, the true-positive, false-positive, and false-negative scores for each video were summed to calculate a representative F1 score, plotted here as individual dots in the half-, quarter-, and eighth-sized datasets. The resulting distributions are compared to scores from the Full Dataset network (left column) and to independent human scores (right). As elsewhere, video-net combinations for which F1 score are undefined are included in confidence interval calculations but not displayed as individual datapoints. These smaller datasets underperformed the Full Dataset network (p = 0.03, 0.03, 0.02) as well as human labels (p = 1.7E−4, 1.4E−4, 3.9E−5), indicating that even small reductions in keypoint tracking quality can impact behavioral detection. *p < 0.05,  ***p < 0.001.

Update of

References

    1. Segalin, C. et al. The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice. Elife10, e63720. 10.7554/eLife.63720 (2021). 10.7554/eLife.63720 - DOI - PMC - PubMed
    1. van den Boom, B. J. G., Pavlidi, P., Wolf, C. J. H., Mooij, A. H. & Willuhn, I. Automated classification of self-grooming in mice using open-source software. J. Neurosci. Methods289, 48–56. 10.1016/j.jneumeth.2017.05.026 (2017). 10.1016/j.jneumeth.2017.05.026 - DOI - PubMed
    1. von Ziegler, L., Sturman, O. & Bohacek, J. Big behavior: Challenges and opportunities in a new era of deep behavior profiling. Neuropsychopharmacology46, 33–44. 10.1038/s41386-020-0751-7 (2021). 10.1038/s41386-020-0751-7 - DOI - PMC - PubMed
    1. Mathis, A. et al. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci.21, 1281–1289 https://doi.org/10.1038/s41593-018-0209-y (2018). 10.1038/s41593-018-0209-y - DOI - PubMed
    1. Ono, K. et al. Retinoic acid degradation shapes zonal development of vestibular organs and sensitivity to transient linear accelerations. Nat. Commun.11, 63. 10.1038/s41467-019-13710-4 (2020). 10.1038/s41467-019-13710-4 - DOI - PMC - PubMed

LinkOut - more resources