. 2025 Oct 6;15(1):34723.

doi: 10.1038/s41598-025-18341-y.

EgoVision a YOLO-ViT hybrid for robust egocentric object recognition

Umm E Sadima¹, Yazeed Alkharijah², Danish Hamid¹, Muhammad Ehatisham Ul Haq³, Syed Muhammad Usman⁴, Shehzad Khalid^{5

6}, Mohamad A Alawad²

Affiliations

¹ Department of Creative Technologies, Faculty of Computing and Artificial Intelligence (FCAI), Air University, Islamabad, 44000, Pakistan.
² Department of Electrical Engineering, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia.
³ School of Childhood and Social Care, University of East London, London, E15 4LZ, UK.
⁴ Department of Computer Science, Bahria University, Islamabad, 44000, Pakistan.
⁵ Department of Computer Engineering, Bahria University, Islamabad, 44000, Pakistan. Shehzad@bahria.edu.pk.
⁶ Computer and Information Sciences Research Center (CISRC), Imam Mohammad ibn Saud Islamic University (IMSIU), 11623, Riyadh, Saudi Arabia. Shehzad@bahria.edu.pk.

PMID: 41053140
PMCID: PMC12501230
DOI: 10.1038/s41598-025-18341-y

EgoVision a YOLO-ViT hybrid for robust egocentric object recognition

Umm E Sadima et al. Sci Rep. 2025.

. 2025 Oct 6;15(1):34723.

doi: 10.1038/s41598-025-18341-y.

Authors

Umm E Sadima¹, Yazeed Alkharijah², Danish Hamid¹, Muhammad Ehatisham Ul Haq³, Syed Muhammad Usman⁴, Shehzad Khalid^{5

6}, Mohamad A Alawad²

Affiliations

¹ Department of Creative Technologies, Faculty of Computing and Artificial Intelligence (FCAI), Air University, Islamabad, 44000, Pakistan.
² Department of Electrical Engineering, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia.
³ School of Childhood and Social Care, University of East London, London, E15 4LZ, UK.
⁴ Department of Computer Science, Bahria University, Islamabad, 44000, Pakistan.
⁵ Department of Computer Engineering, Bahria University, Islamabad, 44000, Pakistan. Shehzad@bahria.edu.pk.
⁶ Computer and Information Sciences Research Center (CISRC), Imam Mohammad ibn Saud Islamic University (IMSIU), 11623, Riyadh, Saudi Arabia. Shehzad@bahria.edu.pk.

PMID: 41053140
PMCID: PMC12501230
DOI: 10.1038/s41598-025-18341-y

Abstract

The rapid advancement of egocentric vision has opened new frontiers in computer vision, particularly in assistive technologies, augmented reality, and human-computer interaction. Despite its potential, object recognition from first-person perspectives remains challenging due to factors such as occlusion, motion blur, and frequent viewpoint changes. This paper introduces EgoVision, a novel and lightweight hybrid deep learning framework that fuses the spatial precision of YOLOv8 with the global contextual reasoning of Vision Transformers (ViT). This research presents EgoVision, a whole new hybrid framework combining YOLOv8 with Vision Transformers for object classification in static egocentric frames. The static images come from the HOI4D dataset. To the best of our knowledge, this is the first time that a fused architecture is applied for static object recognition on HOI4D, specifically for real-time use in robotics and augmented reality applications. The framework employs a key-frame extraction strategy and a feature pyramid network to efficiently handle multiscale spatial-temporal features, significantly reducing computational overhead for real-time applications. Extensive experiments demonstrate that EgoVision outperforms existing models across multiple metrics, achieving up to 99% accuracy on complex object classes such as 'Kettle' and 'Chair', while maintaining high efficiency for deployment on wearable and edge devices. The results establish EgoVision as a robust foundation for next-generation egocentric AI systems.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
EgoVision: Proposed framework for object recognition integrating YOLOv8 and ViT..

**Fig. 2**
Sample frames from HOI4D showing diverse egocentric human-object interactions.

**Fig. 3**
Distribution of samples per object category in the HOI4D.

**Fig. 4**
Key-frame selection capturing pre-, during-, and post-interactions from input videos.

**Fig. 5**
a Dataset distribution for YOLO fine-tuning, b YAML file configuration listing the dataset path and class labels.

**Fig. 6**
Backbone architecture of YOLOv8 for multiscale feature extraction.

**Fig. 7**
Feature Pyramid Network(FPN) structure for aligning multidimensional YOLOv8 features.

**Fig. 8**
Vision Transformer pipeline for global feature extraction via patch splitting, embedding, and encoding.

**Fig. 9**
5-fold cross-validation for evaluating classification performance with Random Forest.

**Fig. 10**
Confusion matrix for object recognition performance.

**Fig. 11**
Bar Plot Representation for class-wise Recognition Accuracy.

**Fig. 12**
Object-Wise F1-Score for Recognition Task.

**Fig. 13**
Precision-Recall Curve: Evaluating Model Performance Across Different Thresholds.

**Fig. 14**
ROC Curve: Evaluating Trade-off Between True Positive Rate and False Positive Rate.

**Fig. 15**
Mean-Average Precision (mAP) Plot illustrating recognition accuracy across all object classes.

See this image and copyright information in PMC

References

1. Nunez-Marcos, A. Azkune, G. & Arganda-Carreras, I. Egocentric vision-based action recognition: A survey. Neurocomputing472, 175–197. 10.1016/j.neucom.2021.11.081 (2022).
1. Li, X. et al. Challenges and trends in egocentric vision: A survey. arXiv preprintarXiv:2503.15275 (2025).
1. Lowe, D. G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60, 91–110 (2004).
1. Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 886–893. 10.1109/CVPR.2005.177 (2005).
1. Bishop, C. M. Pattern recognition and machine learning. In Inf. Sci. Stat. 1st edn (Springer, New York, NY, 2006).

Grants and funding

IMSIU-DDRSP2503/Deanship of Scientific Research, Imam Mohammed Ibn Saud Islamic University

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

EgoVision a YOLO-ViT hybrid for robust egocentric object recognition

Affiliations

EgoVision a YOLO-ViT hybrid for robust egocentric object recognition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources