Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 21;23(8):4167.
doi: 10.3390/s23084167.

A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Affiliations

A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Alfredo Chávez Plascencia et al. Sensors (Basel). .

Abstract

Most pedestrian detection methods focus on bounding boxes based on fusing RGB with lidar. These methods do not relate to how the human eye perceives objects in the real world. Furthermore, lidar and vision can have difficulty detecting pedestrians in scattered environments, and radar can be used to overcome this problem. Therefore, the motivation of this work is to explore, as a preliminary step, the feasibility of fusing lidar, radar, and RGB for pedestrian detection that potentially can be used for autonomous driving that uses a fully connected convolutional neural network architecture for multimodal sensors. The core of the network is based on SegNet, a pixel-wise semantic segmentation network. In this context, lidar and radar were incorporated by transforming them from 3D pointclouds into 2D gray images with 16-bit depths, and RGB images were incorporated with three channels. The proposed architecture uses a single SegNet for each sensor reading, and the outputs are then applied to a fully connected neural network to fuse the three modalities of sensors. Afterwards, an up-sampling network is applied to recover the fused data. Additionally, a custom dataset of 60 images was proposed for training the architecture, with an additional 10 for evaluation and 10 for testing, giving a total of 80 images. The experiment results show a training mean pixel accuracy of 99.7% and a training mean intersection over union of 99.5%. Also, the testing mean of the IoU was 94.4%, and the testing pixel accuracy was 96.2%. These metric results have successfully demonstrated the effectiveness of using semantic segmentation for pedestrian detection under the modalities of three sensors. Despite some overfitting in the model during experimentation, it performed well in detecting people in test mode. Therefore, it is worth emphasizing that the focus of this work is to show that this method is feasible to be used, as it works regardless of the size of the dataset. Also, a bigger dataset would be necessary to achieve a more appropiate training. This method gives the advantage of detecting pedestrians as the human eye does, thereby resulting in less ambiguity. Additionally, this work has also proposed an extrinsic calibration matrix method for sensor alignment between radar and lidar based on singular value decomposition.

Keywords: autonomous driving; convolutional neural networks: sensor calibration; sensor fusion.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The sensors used on the testbench include: (a) a L3CAM sensor consisting of a lidar and an RGB camera on the top and (b) a UMRR-96 Type 153 radar sensor at the bottom.
Figure 2
Figure 2
A schematic overview of the three frames: lidar, radar, and camera. A rotation matrix R and translation vector t from the radar to the lidar frames are also shown.
Figure 3
Figure 3
The styrofoam calibration board has black dashed lines that indicate the location of the corner reflector, which was placed in the center of the back of the board.
Figure 4
Figure 4
A custom trihedral corner reflector made of copper plates showing the side length edge of the three isosceles triangles (a) and the base of the triangles (L).
Figure 5
Figure 5
The figure shows the lidar and radar frames detecting the board center and the corner reflector, respectively.
Figure 6
Figure 6
The figure shows the centers of four circles, two lines (l1,l2), and the middle point (x,y).
Figure 7
Figure 7
The pixel semantic segmentation SegNet CNN network used in the article. The encoder is placed on the left, while the decoder is placed on the right.
Figure 8
Figure 8
The architecture model consists of three decoder sub-networks, one for each sensor, a fully connected neural network, and a decoder.
Figure 9
Figure 9
The RGB image of the calibration board.
Figure 10
Figure 10
The lidar pointcloud is white, whereas the radar pointcloud, which is more sparse, is shown in colored cubes.
Figure 11
Figure 11
The blue sphere represents the lidar center position point; the brown cube represents the radar center position point; and the red sphere represents the aligned radar point. The colored pointcloud is the board’s parallel plane model.
Figure 12
Figure 12
Lidar and radar pointsets before correction.
Figure 13
Figure 13
Lidar and radar pointsets after correction.
Figure 14
Figure 14
Shows the norm of both lidar and radar datasets before and after correction.
Figure 15
Figure 15
A lidar pointcloud of a parking lot is shown, with white spheres representing the corrected outdoor radar dataset and colored spheres representing the radar dataset before correction.
Figure 16
Figure 16
An RGB image of a parking lot shows walking pedestrians.
Figure 17
Figure 17
The image represents the ground truth where the pedestrians are shown in red.
Figure 18
Figure 18
The 3D lidar pointcloud is projected into a 2D grayscale image with 16-bit depth.
Figure 19
Figure 19
The 3D radar pointcloud is projected into 2D grayscale lines with 16-bit depth.
Figure 20
Figure 20
The fusion of the three images corresponding to the lidar, radar, and RGB is shown in red.
Figure 21
Figure 21
The IoU and pixel accuracy are shown for the training mode.
Figure 22
Figure 22
Illustrates the loss entropy.
Figure 23
Figure 23
The blue line represents the loss of the training mode and the orange line indicates the loss of the validation mode.
Figure 24
Figure 24
The IoU and pixel accuracy are shown for the testing mode.
Figure 25
Figure 25
The figure displays in red the overlapping area between the ground truth and the model’s output. The ground truth is depicted in blue, and the white spots are part of the model’s output but do not overlap. The weak blue light over the pedestrians indicates a very weak detection by the model due to its overfitting.

References

    1. Alzubaidi L., Zhang J., Humaidi A.J., Al-dujaili A., Duan Y., Al-Shamma O., Santamaría J., Fadhel M.A., Al-Amidie M., Farhan L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data. 2021;8:53. doi: 10.1186/s40537-021-00444-8. - DOI - PMC - PubMed
    1. Bimbraw K. Autonomous cars: Past, present and future a review of the developments in the last century, the present scenario and the expected future of autonomous vehicle technology; Proceedings of the 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO); Colmar, France. 21–23 July 2015; pp. 191–198.
    1. Yao G., Lei T., Zhong J. A review of Convolutional-Neural-Network-based action recognition. Pattern Recognit. Lett. 2019;118:14–22. doi: 10.1016/j.patrec.2018.05.018. Cooperative and Social Robots: Understanding Human Activities and Intentions. - DOI
    1. Soga M., Kato T., Ohta M., Ninomiya Y. Pedestrian Detection with Stereo Vision; Proceedings of the 21st International Conference on Data Engineering Workshops (ICDEW’05); Tokyo, Japan. 3–4 April 2005; p. 1200. - DOI
    1. Yu X., Marinov M. A Study on Recent Developments and Issues with Obstacle Detection Systems for Automated Vehicles. Sustainability. 2020;12:3281. doi: 10.3390/su12083281. - DOI

LinkOut - more resources