Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 30;24(9):2889.
doi: 10.3390/s24092889.

FusionVision: A Comprehensive Approach of 3D Object Reconstruction and Segmentation from RGB-D Cameras Using YOLO and Fast Segment Anything

Affiliations

FusionVision: A Comprehensive Approach of 3D Object Reconstruction and Segmentation from RGB-D Cameras Using YOLO and Fast Segment Anything

Safouane El Ghazouali et al. Sensors (Basel). .

Abstract

In the realm of computer vision, the integration of advanced techniques into the pre-processing of RGB-D camera inputs poses a significant challenge, given the inherent complexities arising from diverse environmental conditions and varying object appearances. Therefore, this paper introduces FusionVision, an exhaustive pipeline adapted for the robust 3D segmentation of objects in RGB-D imagery. Traditional computer vision systems face limitations in simultaneously capturing precise object boundaries and achieving high-precision object detection on depth maps, as they are mainly proposed for RGB cameras. To address this challenge, FusionVision adopts an integrated approach by merging state-of-the-art object detection techniques, with advanced instance segmentation methods. The integration of these components enables a holistic (unified analysis of information obtained from both color RGB and depth D channels) interpretation of RGB-D data, facilitating the extraction of comprehensive and accurate object information in order to improve post-processes such as object 6D pose estimation, Simultanious Localization and Mapping (SLAM) operations, accurate 3D dataset extraction, etc. The proposed FusionVision pipeline employs YOLO for identifying objects within the RGB image domain. Subsequently, FastSAM, an innovative semantic segmentation model, is applied to delineate object boundaries, yielding refined segmentation masks. The synergy between these components and their integration into 3D scene understanding ensures a cohesive fusion of object detection and segmentation, enhancing overall precision in 3D object segmentation.

Keywords: 3D localization; 3D object detection; 3D reconstruction; RGB-D; SAM; point-cloud.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Example of RGB-D camera scene capturing and 3D reconstruction. (a) 3D reconstruction from RGB-D depth-channel. (b) RGB stream capture from RGB sensor. (c) Visual estimation of depth with the ColorMap JET (the closer object are represented in green and far ones are the dark blue regions).
Figure 2
Figure 2
Complex YOLO framework for 3D object reconstruction and localization [47].
Figure 3
Figure 3
Proposed FusionVision pipeline for real-time 3D object segmentation and localization using fused YOLO and FastSAM applied on RGB-D sensor.
Figure 4
Figure 4
Visual representation of RGB camera alignment with the depth sensor.
Figure 5
Figure 5
Example of acquired images for YOLO training: the top two images are original, the bottom ones are augmented images.
Figure 6
Figure 6
YOLO training curves: (a) bbox loss, (b) cls loss, (c) precision and recall, and (d) mAP50 and mAP50-95.
Figure 7
Figure 7
Visuals of the YOLO detection, FastSAM mask extraction, and binary mask estimation: (a) using the pre-trained YOLO model; (b) using the custom trained YOLO model.
Figure 8
Figure 8
Overall evaluation metrics of FastSAM applied on extracted YOLO bounding boxes and compared to ground truth annotation. The blue points refers to the values of the metrics and black segments are standard deviations.
Figure 9
Figure 9
Example of FastSAM misestimation of the segmentation mask: (a) original image, (b) ground truth annotation mask, and (c) FastSAM estimated mask.
Figure 10
Figure 10
Three-dimensional object reconstruction from aligned FastSAM mask: (a) raw point-cloud and (b) post-processing point-cloud by voxel downsampling and statistical denoiser technique. The left images visualizing the YOLO detection, FastSAM mask extraction, and Binary mask estimation at specific positions of the physical objects within the frame.
Figure 11
Figure 11
Post-processing impact on 3D object reconstruction: (a) raw point-clouds, (b) Downsampled point-clouds, and (c) Downsampled + denoised point-clouds.

References

    1. Liu M. Robotic Online Path Planning on Point Cloud. IEEE Trans. Cybern. 2016;46:1217–1228. doi: 10.1109/TCYB.2015.2430526. - DOI - PubMed
    1. Ding Z., Sun Y., Xu S., Pan Y., Peng Y., Mao Z. Recent Advances and Perspectives in Deep Learning Techniques for 3D Point Cloud Data Processing. Robotics. 2023;12:100. doi: 10.3390/robotics12040100. - DOI
    1. Krawczyk D., Sitnik R. Segmentation of 3D Point Cloud Data Representing Full Human Body Geometry: A Review. Pattern Recognit. 2023;139:109444. doi: 10.1016/j.patcog.2023.109444. - DOI
    1. Wu F., Qian Y., Zheng H., Zhang Y., Zheng X. A Novel Neighbor Aggregation Function for Medical Point Cloud Analysis; Proceedings of the Computer Graphics International Conference; Shanghai, China. 28 August–1 September 2023; Berlin/Heidelberg, Germany: Springer; 2023. pp. 301–312.
    1. Xie X., Wei H., Yang Y. Real-Time LiDAR Point-Cloud Moving Object Segmentation for Autonomous Driving. Sensors. 2023;23:547. doi: 10.3390/s23010547. - DOI - PMC - PubMed

LinkOut - more resources