. 2020 Aug 27;20(17):4856.

doi: 10.3390/s20174856.

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Kang Huang¹, Xingtian Qu¹, Shouqian Chen², Zhen Chen¹, Wang Zhang¹, Haogang Qi¹, Fengshang Zhao¹

Affiliations

¹ Department of Mechanical Engineering and Automation, School of Mechanical and Aerospace Engineering, Jilin University, Changchun 130022, China.
² Research Center for Space Optical Engineering, Harbin Institute of Technology, P.O. Box 307, Harbin 150001, China.

PMID: 32867293
PMCID: PMC7506624
DOI: 10.3390/s20174856

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Kang Huang et al. Sensors (Basel). 2020.

. 2020 Aug 27;20(17):4856.

doi: 10.3390/s20174856.

Authors

Kang Huang¹, Xingtian Qu¹, Shouqian Chen², Zhen Chen¹, Wang Zhang¹, Haogang Qi¹, Fengshang Zhao¹

Affiliations

¹ Department of Mechanical Engineering and Automation, School of Mechanical and Aerospace Engineering, Jilin University, Changchun 130022, China.
² Research Center for Space Optical Engineering, Harbin Institute of Technology, P.O. Box 307, Harbin 150001, China.

PMID: 32867293
PMCID: PMC7506624
DOI: 10.3390/s20174856

Abstract

Accurately sensing the surrounding 3D scene is indispensable for drones or robots to execute path planning and navigation. In this paper, a novel monocular depth estimation method was proposed that primarily utilizes a lighter-weight Convolutional Neural Network (CNN) structure for coarse depth prediction and then refines the coarse depth images by combining surface normal guidance. Specifically, the coarse depth prediction network is designed as pre-trained encoder-decoder architecture for describing the 3D structure. When it comes to surface normal estimation, the deep learning network was designed as a two-stream encoder-decoder structure, which hierarchically merges red-green-blue-depth (RGB-D) images for capturing more accurate geometric boundaries. Relying on fewer network parameters and simpler learning structure, better detailed depth maps are produced than the existing states. Moreover, 3D point cloud maps reconstructed from depth prediction images confirm that our framework can be conveniently adopted as components of a monocular simultaneous localization and mapping (SLAM) paradigm.

Keywords: SFM; SLAM; monocular depth estimation; multi-task learning; supervised deep learning; surface normal estimation; transfer learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure A1**
Results of different methods on NYU V2 Depth. (a) Red-green-blue (RGB) images, (b) ground truth (GT), (c) Ranftl [29], (d) Hu [30], (e) Alhashim [3], (f) ours.

**Figure A2**
Three-dimensional point cloud maps comparison on NYU Depth V2.

**Figure A3**
Point cloud maps were reconstructed based on custom images.

**Figure 1**
The comparison of depth maps were produced by different methods. (a) Raw red-green-blue (RGB) images (b) Ground truth (GT) depth maps [14], (c) Depth maps from the state-of-the-art (SOTA) practice [7], (d) Depth maps from our depth prediction network.

**Figure 2**
The comparison of surface normal maps: (a) from left to right: RGB images, Ground-Truth (GT), surface normal maps produced by Qi et al. [10], ours. (b) Color-map definition: red represents left, green represents up, and blue represents outward.

**Figure 3**
General estimation framework. (a) Coarse depth estimation network; (b) red-green-blue-depth (RGB-D) surface normal network; (c) refinement network.

**Figure 4**
Encoder–decoder coarse depth network.

**Figure 5**
Generating coarse surface normal image. (a) Coarse depth ( $D^{*}$ ); (b) $N^{*}$ .

**Figure 6**
Surface normal adjustment network (Dense-net-121 based). (a) The general structure of the RGB-D surface normal network (RSN) network; (b) the architectures of up-projection units, fusion module and convolution blocks.

**Figure 7**
Comparison of point clouds from estimated depth maps between Alhashim [6] and ours. (a) Alhashim [6], (b) GT, (c) Ours. GT stands for 3D point cloud maps from ground truth images.

**Figure 8**
Qualitative results for depth estimation. (a) RGB images, (b) ground truth (GT), (c) Alhashim [2], (d) Our Coarse; (e) Refined Depth.

**Figure 9**
Qualitative results for surface normal estimation. (a) RGB images, (b) ground truth normal maps, (c), Geo-Net [10], (d) reconstructed from in-painted ground truth depth (e) ours, and (f) reconstructed from refined depth. All images are equally scaled for better visualization.

**Figure 10**
Comparison of time consumption: Runtimes of different up-sampling methods including. 2× bilinear interpolation [6], up and down projection [48], and up-projection [17].

**Figure 11**
Refined depth images generating from custom images (densenet-161 model).

See this image and copyright information in PMC

Cited by

Monocular Depth Estimation with Joint Attention Feature Distillation and Wavelet-Based Loss Function.
Liu P, Zhang Z, Meng Z, Gao N. Liu P, et al. Sensors (Basel). 2020 Dec 24;21(1):54. doi: 10.3390/s21010054. Sensors (Basel). 2020. PMID: 33374278 Free PMC article.
Potential Obstacle Detection Using RGB to Depth Image Encoder-Decoder Network: Application to Unmanned Aerial Vehicles.
Hachaj T. Hachaj T. Sensors (Basel). 2022 Sep 5;22(17):6703. doi: 10.3390/s22176703. Sensors (Basel). 2022. PMID: 36081162 Free PMC article.
Monocular Depth Estimation: Lightweight Convolutional and Matrix Capsule Feature-Fusion Network.
Wang Y, Zhu H. Wang Y, et al. Sensors (Basel). 2022 Aug 23;22(17):6344. doi: 10.3390/s22176344. Sensors (Basel). 2022. PMID: 36080801 Free PMC article.

References

1. Guizilini V., Ambrus R., Pillai S., Gaidon A. Packnet-sfm: 3d packing for self-supervised monocular depth estimation. arXiv. 20191905.02693
1. Ummenhofer B., Zhou H., Uhrig J., Mayer N., Ilg E., Dosovitskiy A., Brox T. Demon: Depth and motion network for learning monocular stereo; Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 5622–5631.
1. Zhou H., Ummenhofer B., Brox T. Deeptam: Deep tracking and mapping; Proceedings of the European Conference on Computer Visio (ECCV); Munich, Germany. 8–14 September 2018; pp. 822–838.
1. Yu C., Liu Z., Liu X., Xie F., Yang Y., Wei Q., Fei Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments; Proceedings of the 2018 IEEE International Conference on Intelligent Robots and Systems (IROS); Madrid, Spain. 1–5 October 2018; pp. 1168–1174.
1. Huang G., Liu Z., Maaten L., Weinberger K.Q. Densely connected convolutional networks; Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 2261–2269.

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Affiliations

Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources