Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jun 3:2025.05.31.657183.
doi: 10.1101/2025.05.31.657183.

Assessing Attentiveness and Cognitive Engagement across Tasks using Video-based Action Understanding in Non-Human Primates

Affiliations

Assessing Attentiveness and Cognitive Engagement across Tasks using Video-based Action Understanding in Non-Human Primates

Sin-Man Cheung et al. bioRxiv. .

Abstract

Background: Distractibility and attentiveness are cognitive states that are expressed through observable behavior. The effective use of behavior observed in videos to diagnose periods of distractibility and attentiveness is still not well understood. Video-based tools for classifying cognitive states from behavior have high potential to serve as versatile diagnostic indicators of maladaptive cognition.

New method: We describe an analysis pipeline that classifies cognitive states using a 2-camera set-up for video-based estimation of attentiveness and screen engagement in nonhuman primates performing cognitive tasks. The procedure reconstructs 3D poses from 2D labeled DeepLabCut videos, reconstructs the head/yaw orientation relative to a task screen, and arm/hand/wrist engagements with task objects, to segment behavior into an attentiveness and engagement score.

Results: Performance of different cognitive tasks were robustly classified from video within a few frames, reaching >90% decoding accuracy with ≤3min time segments. The analysis procedure allows setting subject-specific thresholds for segmenting subject specific movements for a time-resolved scoring of attentiveness and screen engagement.

Comparison with existing methods: Current methods also extract poses and segment action units; however, they haven't been combined into a framework that enables subject-adjusted thresholding for specific task contexts. This integration is needed for inferring cognitive state variables and differentiating performance across various tasks.

Conclusion: The proposed method integrates video segmentation, scoring of attentiveness and screen engagement, and classification of task performance at high temporal resolution. This integrated framework provides a tool for assessing attention functions from video.

PubMed Disclaimer

Figures

Figure 1 |
Figure 1 |. Procedural Pipelines.
(A) Workflow for 3D pose estimation and classification using DeepLabCut and MATLAB/Python. Pose estimation from left and right cameras is processed via DeepLabCut, followed by 2D data extraction, triangulation, and 3D data post-processing in MATLAB. Attentiveness and screen engagement are classified, with results visualized and analyzed using Python/MATLAB. (B) Pipeline for pose estimation of rhesus macaques using DeepLabCut. Video frames (n=4 subjects) are extracted using k-means clustering (20 frames/video, 5 videos/camera, 360 frames/side). Eleven body parts are labeled, and a training dataset is created. The ResNet50 network is trained (30,000 iterations, learning rate 0.005/0.002) until loss plateaus. Outlier frames are extracted, relabeled, and used to export pose estimation for video analysis.
Figure 2 |
Figure 2 |. Example Classification of Attentiveness and Screen Engagement.
(A) Analysis of NHP attentiveness over a 2-minute video segment using yaw angle as the primary feature. The black line represents the yaw angle, with red horizontal lines indicating left and right thresholds for attentiveness. Green bars denote attentive periods (yaw within thresholds), while grey bars indicate inattentive periods (yaw beyond thresholds). Image sequences at frames 640–660 and 2680–2700 show transitions from attentive to inattentive states, while frames 740–760 capture the NHP not being attentive. (B) Screen engagement analysis over a 30-second segment, based on the right wrist’s proximity to the touchscreen. The red horizontal line denotes the engagement threshold. Green bars highlight active engagement (distance below threshold), while distances above the threshold indicate disengagement. Frames 35–40 show the NHP initiating engagement, frames 60–69 depict disengagement, and frames 248–252 capture active screen engagement.
Figure 3 |
Figure 3 |. Visualization of Attentiveness and Screen Engagement Scores.
(A) Schematic of the data processing pipeline for cross-trial analysis, summarizing attentiveness and screen engagement scores (0s and 1s) across frames, either by window size or by task (WM1, M1, EC, M2, WM2). (B) Example attentiveness score over a 5000-frame video segment, showing binary classification (attentive vs. inattentive). (C) Screen engagement score over a 90-minute session, averaged over 300 second windows, with red lines denoting transition between tasks. (D) Mean attentiveness scores per task (n = 31 sessions), with standard error bars representing variability across trials (E) Mean screen engagement scores per task (n = 31 sessions), with standard error bars representing variability across trials.
Figure 4 |
Figure 4 |. Task Classification Using Attentiveness and Screen Engagement Metrics.
(A) Pipeline for task classification (Working Memory [WM], Maze [M], Effort Control [EC]) using attentiveness and screen engagement scores via supervised and unsupervised classifiers. (B) Classification accuracy versus window size, with Random Forest (black), K-Means (green), and chance (red dashed line); red circles mark peak accuracies. (C) Confusion matrix for K-Means at 450-frame window size, with accuracy of 0.575. (D) Feature importance for Random Forest at 540-frame window size, with attentiveness (0.222 ±0.007) and screen engagement (0.778 ±0.007). (E) Confusion matrix for Random Forest at 540-frame window size, with accuracy of 0.910.

Similar articles

References

    1. Azimi M, Oemisch M, Womelsdorf T (2020) Dissociation of nicotinic alpha7 and alpha4/beta2 sub-receptor agonists for enhancing learning and attentional filtering in nonhuman primates. Psychopharmacology (Berl) 237:997–1010. - PubMed
    1. Bain M, Nagrani A, Schofield D, Berdugo S, Bessa J, Owen J, Hockings KJ, Matsuzawa T, Hayashi M, Biro D, Carvalho S, Zisserman A (2021) Automated audiovisual behavior recognition in wild primates. Science Advances 7. - PMC - PubMed
    1. Bala PC, Eisenreich BR, Yoo SBM, Hayden BY, Park HS, Zimmermann J (2020) Automated markerless pose estimation in freely moving macaques with OpenMonkeyStudio. Nature Communications 11. - PMC - PubMed
    1. Biderman D et al. (2024) Lightning Pose: improved animal pose estimation via semi-supervised learning, Bayesian ensembling and cloud-native open-source tools. Nat Methods 21:1316–1328. - PMC - PubMed
    1. Breiman L (2001) Random Forests. Machine Learning 45:5–32.

Publication types

LinkOut - more resources