Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 6;15(1):34723.
doi: 10.1038/s41598-025-18341-y.

EgoVision a YOLO-ViT hybrid for robust egocentric object recognition

Affiliations

EgoVision a YOLO-ViT hybrid for robust egocentric object recognition

Umm E Sadima et al. Sci Rep. .

Abstract

The rapid advancement of egocentric vision has opened new frontiers in computer vision, particularly in assistive technologies, augmented reality, and human-computer interaction. Despite its potential, object recognition from first-person perspectives remains challenging due to factors such as occlusion, motion blur, and frequent viewpoint changes. This paper introduces EgoVision, a novel and lightweight hybrid deep learning framework that fuses the spatial precision of YOLOv8 with the global contextual reasoning of Vision Transformers (ViT). This research presents EgoVision, a whole new hybrid framework combining YOLOv8 with Vision Transformers for object classification in static egocentric frames. The static images come from the HOI4D dataset. To the best of our knowledge, this is the first time that a fused architecture is applied for static object recognition on HOI4D, specifically for real-time use in robotics and augmented reality applications. The framework employs a key-frame extraction strategy and a feature pyramid network to efficiently handle multiscale spatial-temporal features, significantly reducing computational overhead for real-time applications. Extensive experiments demonstrate that EgoVision outperforms existing models across multiple metrics, achieving up to 99% accuracy on complex object classes such as 'Kettle' and 'Chair', while maintaining high efficiency for deployment on wearable and edge devices. The results establish EgoVision as a robust foundation for next-generation egocentric AI systems.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
EgoVision: Proposed framework for object recognition integrating YOLOv8 and ViT..
Fig. 2
Fig. 2
Sample frames from HOI4D showing diverse egocentric human-object interactions.
Fig. 3
Fig. 3
Distribution of samples per object category in the HOI4D.
Fig. 4
Fig. 4
Key-frame selection capturing pre-, during-, and post-interactions from input videos.
Fig. 5
Fig. 5
a Dataset distribution for YOLO fine-tuning, b YAML file configuration listing the dataset path and class labels.
Fig. 6
Fig. 6
Backbone architecture of YOLOv8 for multiscale feature extraction.
Fig. 7
Fig. 7
Feature Pyramid Network(FPN) structure for aligning multidimensional YOLOv8 features.
Fig. 8
Fig. 8
Vision Transformer pipeline for global feature extraction via patch splitting, embedding, and encoding.
Fig. 9
Fig. 9
5-fold cross-validation for evaluating classification performance with Random Forest.
Fig. 10
Fig. 10
Confusion matrix for object recognition performance.
Fig. 11
Fig. 11
Bar Plot Representation for class-wise Recognition Accuracy.
Fig. 12
Fig. 12
Object-Wise F1-Score for Recognition Task.
Fig. 13
Fig. 13
Precision-Recall Curve: Evaluating Model Performance Across Different Thresholds.
Fig. 14
Fig. 14
ROC Curve: Evaluating Trade-off Between True Positive Rate and False Positive Rate.
Fig. 15
Fig. 15
Mean-Average Precision (mAP) Plot illustrating recognition accuracy across all object classes.

References

    1. Nunez-Marcos, A. Azkune, G. & Arganda-Carreras, I. Egocentric vision-based action recognition: A survey. Neurocomputing472, 175–197. 10.1016/j.neucom.2021.11.081 (2022).
    1. Li, X. et al. Challenges and trends in egocentric vision: A survey. arXiv preprintarXiv:2503.15275 (2025).
    1. Lowe, D. G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60, 91–110 (2004).
    1. Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 886–893. 10.1109/CVPR.2005.177 (2005).
    1. Bishop, C. M. Pattern recognition and machine learning. In Inf. Sci. Stat. 1st edn (Springer, New York, NY, 2006).

LinkOut - more resources