Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 27;20(17):4856.
doi: 10.3390/s20174856.

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Affiliations

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Kang Huang et al. Sensors (Basel). .

Abstract

Accurately sensing the surrounding 3D scene is indispensable for drones or robots to execute path planning and navigation. In this paper, a novel monocular depth estimation method was proposed that primarily utilizes a lighter-weight Convolutional Neural Network (CNN) structure for coarse depth prediction and then refines the coarse depth images by combining surface normal guidance. Specifically, the coarse depth prediction network is designed as pre-trained encoder-decoder architecture for describing the 3D structure. When it comes to surface normal estimation, the deep learning network was designed as a two-stream encoder-decoder structure, which hierarchically merges red-green-blue-depth (RGB-D) images for capturing more accurate geometric boundaries. Relying on fewer network parameters and simpler learning structure, better detailed depth maps are produced than the existing states. Moreover, 3D point cloud maps reconstructed from depth prediction images confirm that our framework can be conveniently adopted as components of a monocular simultaneous localization and mapping (SLAM) paradigm.

Keywords: SFM; SLAM; monocular depth estimation; multi-task learning; supervised deep learning; surface normal estimation; transfer learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure A1
Figure A1
Results of different methods on NYU V2 Depth. (a) Red-green-blue (RGB) images, (b) ground truth (GT), (c) Ranftl [29], (d) Hu [30], (e) Alhashim [3], (f) ours.
Figure A2
Figure A2
Three-dimensional point cloud maps comparison on NYU Depth V2.
Figure A3
Figure A3
Point cloud maps were reconstructed based on custom images.
Figure 1
Figure 1
The comparison of depth maps were produced by different methods. (a) Raw red-green-blue (RGB) images (b) Ground truth (GT) depth maps [14], (c) Depth maps from the state-of-the-art (SOTA) practice [7], (d) Depth maps from our depth prediction network.
Figure 2
Figure 2
The comparison of surface normal maps: (a) from left to right: RGB images, Ground-Truth (GT), surface normal maps produced by Qi et al. [10], ours. (b) Color-map definition: red represents left, green represents up, and blue represents outward.
Figure 3
Figure 3
General estimation framework. (a) Coarse depth estimation network; (b) red-green-blue-depth (RGB-D) surface normal network; (c) refinement network.
Figure 4
Figure 4
Encoder–decoder coarse depth network.
Figure 5
Figure 5
Generating coarse surface normal image. (a) Coarse depth (D*); (b) N*.
Figure 6
Figure 6
Surface normal adjustment network (Dense-net-121 based). (a) The general structure of the RGB-D surface normal network (RSN) network; (b) the architectures of up-projection units, fusion module and convolution blocks.
Figure 7
Figure 7
Comparison of point clouds from estimated depth maps between Alhashim [6] and ours. (a) Alhashim [6], (b) GT, (c) Ours. GT stands for 3D point cloud maps from ground truth images.
Figure 8
Figure 8
Qualitative results for depth estimation. (a) RGB images, (b) ground truth (GT), (c) Alhashim [2], (d) Our Coarse; (e) Refined Depth.
Figure 9
Figure 9
Qualitative results for surface normal estimation. (a) RGB images, (b) ground truth normal maps, (c), Geo-Net [10], (d) reconstructed from in-painted ground truth depth (e) ours, and (f) reconstructed from refined depth. All images are equally scaled for better visualization.
Figure 10
Figure 10
Comparison of time consumption: Runtimes of different up-sampling methods including. 2× bilinear interpolation [6], up and down projection [48], and up-projection [17].
Figure 11
Figure 11
Refined depth images generating from custom images (densenet-161 model).

Similar articles

Cited by

References

    1. Guizilini V., Ambrus R., Pillai S., Gaidon A. Packnet-sfm: 3d packing for self-supervised monocular depth estimation. arXiv. 20191905.02693
    1. Ummenhofer B., Zhou H., Uhrig J., Mayer N., Ilg E., Dosovitskiy A., Brox T. Demon: Depth and motion network for learning monocular stereo; Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 5622–5631.
    1. Zhou H., Ummenhofer B., Brox T. Deeptam: Deep tracking and mapping; Proceedings of the European Conference on Computer Visio (ECCV); Munich, Germany. 8–14 September 2018; pp. 822–838.
    1. Yu C., Liu Z., Liu X., Xie F., Yang Y., Wei Q., Fei Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments; Proceedings of the 2018 IEEE International Conference on Intelligent Robots and Systems (IROS); Madrid, Spain. 1–5 October 2018; pp. 1168–1174.
    1. Huang G., Liu Z., Maaten L., Weinberger K.Q. Densely connected convolutional networks; Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 2261–2269.

LinkOut - more resources