Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 7:14:616775.
doi: 10.3389/fnbot.2020.616775. eCollection 2020.

A Manufacturing-Oriented Intelligent Vision System Based on Deep Neural Network for Object Recognition and 6D Pose Estimation

Affiliations

A Manufacturing-Oriented Intelligent Vision System Based on Deep Neural Network for Object Recognition and 6D Pose Estimation

Guoyuan Liang et al. Front Neurorobot. .

Abstract

Nowadays, intelligent robots are widely applied in the manufacturing industry, in various working places or assembly lines. In most manufacturing tasks, determining the category and pose of parts is important, yet challenging, due to complex environments. This paper presents a new two-stage intelligent vision system based on a deep neural network with RGB-D image inputs for object recognition and 6D pose estimation. A dense-connected network fusing multi-scale features is first built to segment the objects from the background. The 2D pixels and 3D points in cropped object regions are then fed into a pose estimation network to make object pose predictions based on fusion of color and geometry features. By introducing the channel and position attention modules, the pose estimation network presents an effective feature extraction method, by stressing important features whilst suppressing unnecessary ones. Comparative experiments with several state-of-the-art networks conducted on two well-known benchmark datasets, YCB-Video and LineMOD, verified the effectiveness and superior performance of the proposed method. Moreover, we built a vision-guided robotic grasping system based on the proposed method using a Kinova Jaco2 manipulator with an RGB-D camera installed. Grasping experiments proved that the robot system can effectively implement common operations such as picking up and moving objects, thereby demonstrating its potential to be applied in all kinds of real-time manufacturing applications.

Keywords: 6D pose estimation; deep neural network; intelligent manufacturing; object recognition; semantic segmentation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
The overall framework of the two-stage network for object recognition and pose estimation.
Figure 2
Figure 2
Four coordinate systems in the pinhole camera model: world coordinate system, camera coordinate system, image coordinate system and pixel coordinate system. P(Xw, Yw, Zw) is a 3D point and p(x, y) is the projection of P(Xw, Yw, Zw) in the image plane. f is the focal length that is the length between origin of camera coordinate system and origin of image coordinate system.
Figure 3
Figure 3
The framework of semantic segmentation network. (A) Network architecture. VGG16 is utilized to extract features from the image while MFFM is applied to aggregate feature maps from different layers. (B) The structure of MFFM. (C) Legends for (A,B).
Figure 4
Figure 4
Object 6D pose estimation. The pose transformation from the object coordinate system to the camera coordinate system is determined by the 3D rotation matrix R and the 3D translation vector t.
Figure 5
Figure 5
The framework of the three-stage 6D pose estimation network. (A) Feature extraction stage: The color feature embedding is extracted by a full convolution network and the geometric feature embedding is extracted by a PointNet-based network; (B) Feature fusion stage: Two feature embeddings are fused together and then pass through the channel attention module, position attention module, and global feature extraction module, respectively, to generate three types of features. All of them are fused and fed to the pose predictor. (C) Pose regression stage: The pose predictor consisting of several 1D convolutions is utilized to regress the 6D pose parameters and confidence scores.
Figure 6
Figure 6
Schematic diagram of position attention module. N is number of the features and C is the feature dimension.
Figure 7
Figure 7
Schematic diagram of channel attention module. N is the number of features and C is the feature dimension.
Figure 8
Figure 8
The accuracy-threshold curves of pose parameter error. (A) The accuracy-threshold curve of rotation angle error, (B) The accuracy-threshold curve of translation error.
Figure 9
Figure 9
Some qualitative experimental results on the YCB-Video dataset. (A) The original images in the dataset, (B) Segmentation results of DenseFusion, (C) Pose estimation results of DenseFusion, (D) Segmentation results of our method, (E) Pose estimation results of our method.
Figure 10
Figure 10
Pose estimation results of our method for some images with cluttered background in the LineMOD dataset. The red box and the green box are 2D projections of the 3D bounding box of objects which, transformed by true pose parameters and predicted ones, respectively.
Figure 11
Figure 11
The framework of a vision-guided robotic grasping system.
Figure 12
Figure 12
Equipment and target objects used in grasping experiments. (A) Some building blocks as target objects, (B) The Kinova Jaco2 manipulator with a percipio RGB-D camera installed on the side of the gripper.
Figure 13
Figure 13
Some experimental results of the robot vision system. Panel (A) show the segmentation results, where different colors represent different objects. Panel (B) shows the pose estimation results, where the colored points are the 2D projections of the target object point cloud after pose transform.
Figure 14
Figure 14
The complete process of picking objects and moving it to target area by the manipulator.

References

    1. Badrinarayanan V., Kendall A., Cipolla R. (2017). Segnet: a deep convolutional encoder-decoder architecture for scene segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1–1. 10.1109/TPAMI.2016.2644615 - DOI - PubMed
    1. Brachmann E., Krull A., Michel F., Gumhold S., Shotton J., Rother C. (2014). Learning 6D object pose estimation using 3D object coordinates, in European Conference on Computer Vision (ECCV) (Cham: Springer; ), 536–551. 10.1007/978-3-319-10605-2_35 - DOI
    1. Chen L.-C., Papandreou G., Kokkinos I., Murphy K., Yuille A. L. (2017a). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848. 10.1109/TPAMI.2017.2699184 - DOI - PubMed
    1. Chen L. C., Papandreou G., Schroff F., Adam H. (2017b). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
    1. Drozdzal M., Vorontsov E., Chartrand G., Kadoury S., Pal C. (2016). The importance of skip connections in biomedical image segmentation, in Deep Learning and Data Labeling for Medical Applications (Cham: Springer; ), 179–187. 10.1007/978-3-319-46976-8_19 - DOI

LinkOut - more resources