. 2023 Apr 21;23(8):4167.

doi: 10.3390/s23084167.

A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Alfredo Chávez Plascencia¹, Pablo García-Gómez², Eduardo Bernal Perez¹, Gerard DeMas-Giménez¹, Josep R Casas³, Santiago Royo^{1

2}

Affiliations

¹ Centre for Sensors, Instrumentation and Systems Development (CD6), Polytechnic University of Catalonia (UPC), Rambla de Sant Nebridi 10, 08222 Terrassa, Spain.
² Beamagine S.L. Carrer de Bellesguard 16, 08755 Castellbisbal, Spain.
³ Image Processing Group, TSC Department, Polytechnic University of Catalonia (UPC), Carrer de Jordi Girona 1-3, 08034 Barcelona, Spain.

PMID: 37112506
PMCID: PMC10144184
DOI: 10.3390/s23084167

A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Alfredo Chávez Plascencia et al. Sensors (Basel). 2023.

. 2023 Apr 21;23(8):4167.

doi: 10.3390/s23084167.

Authors

Alfredo Chávez Plascencia¹, Pablo García-Gómez², Eduardo Bernal Perez¹, Gerard DeMas-Giménez¹, Josep R Casas³, Santiago Royo^{1

2}

Affiliations

¹ Centre for Sensors, Instrumentation and Systems Development (CD6), Polytechnic University of Catalonia (UPC), Rambla de Sant Nebridi 10, 08222 Terrassa, Spain.
² Beamagine S.L. Carrer de Bellesguard 16, 08755 Castellbisbal, Spain.
³ Image Processing Group, TSC Department, Polytechnic University of Catalonia (UPC), Carrer de Jordi Girona 1-3, 08034 Barcelona, Spain.

PMID: 37112506
PMCID: PMC10144184
DOI: 10.3390/s23084167

Abstract

Most pedestrian detection methods focus on bounding boxes based on fusing RGB with lidar. These methods do not relate to how the human eye perceives objects in the real world. Furthermore, lidar and vision can have difficulty detecting pedestrians in scattered environments, and radar can be used to overcome this problem. Therefore, the motivation of this work is to explore, as a preliminary step, the feasibility of fusing lidar, radar, and RGB for pedestrian detection that potentially can be used for autonomous driving that uses a fully connected convolutional neural network architecture for multimodal sensors. The core of the network is based on SegNet, a pixel-wise semantic segmentation network. In this context, lidar and radar were incorporated by transforming them from 3D pointclouds into 2D gray images with 16-bit depths, and RGB images were incorporated with three channels. The proposed architecture uses a single SegNet for each sensor reading, and the outputs are then applied to a fully connected neural network to fuse the three modalities of sensors. Afterwards, an up-sampling network is applied to recover the fused data. Additionally, a custom dataset of 60 images was proposed for training the architecture, with an additional 10 for evaluation and 10 for testing, giving a total of 80 images. The experiment results show a training mean pixel accuracy of 99.7% and a training mean intersection over union of 99.5%. Also, the testing mean of the IoU was 94.4%, and the testing pixel accuracy was 96.2%. These metric results have successfully demonstrated the effectiveness of using semantic segmentation for pedestrian detection under the modalities of three sensors. Despite some overfitting in the model during experimentation, it performed well in detecting people in test mode. Therefore, it is worth emphasizing that the focus of this work is to show that this method is feasible to be used, as it works regardless of the size of the dataset. Also, a bigger dataset would be necessary to achieve a more appropiate training. This method gives the advantage of detecting pedestrians as the human eye does, thereby resulting in less ambiguity. Additionally, this work has also proposed an extrinsic calibration matrix method for sensor alignment between radar and lidar based on singular value decomposition.

Keywords: autonomous driving; convolutional neural networks: sensor calibration; sensor fusion.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
The sensors used on the testbench include: (a) a L3CAM sensor consisting of a lidar and an RGB camera on the top and (b) a UMRR-96 Type 153 radar sensor at the bottom.

**Figure 2**
A schematic overview of the three frames: lidar, radar, and camera. A rotation matrix $R$ and translation vector $t$ from the radar to the lidar frames are also shown.

**Figure 3**
The styrofoam calibration board has black dashed lines that indicate the location of the corner reflector, which was placed in the center of the back of the board.

**Figure 4**
A custom trihedral corner reflector made of copper plates showing the side length edge of the three isosceles triangles $(a)$ and the base of the triangles $(L)$ .

**Figure 5**
The figure shows the lidar and radar frames detecting the board center and the corner reflector, respectively.

**Figure 6**
The figure shows the centers of four circles, two lines ( $l_{1}, l_{2}$ ), and the middle point (x,y).

**Figure 7**
The pixel semantic segmentation SegNet CNN network used in the article. The encoder is placed on the left, while the decoder is placed on the right.

**Figure 8**
The architecture model consists of three decoder sub-networks, one for each sensor, a fully connected neural network, and a decoder.

**Figure 9**
The RGB image of the calibration board.

**Figure 10**
The lidar pointcloud is white, whereas the radar pointcloud, which is more sparse, is shown in colored cubes.

**Figure 11**
The blue sphere represents the lidar center position point; the brown cube represents the radar center position point; and the red sphere represents the aligned radar point. The colored pointcloud is the board’s parallel plane model.

**Figure 12**
Lidar and radar pointsets before correction.

**Figure 13**
Lidar and radar pointsets after correction.

**Figure 14**
Shows the norm of both lidar and radar datasets before and after correction.

**Figure 15**
A lidar pointcloud of a parking lot is shown, with white spheres representing the corrected outdoor radar dataset and colored spheres representing the radar dataset before correction.

**Figure 16**
An RGB image of a parking lot shows walking pedestrians.

**Figure 17**
The image represents the ground truth where the pedestrians are shown in red.

**Figure 18**
The 3D lidar pointcloud is projected into a 2D grayscale image with 16-bit depth.

**Figure 19**
The 3D radar pointcloud is projected into 2D grayscale lines with 16-bit depth.

**Figure 20**
The fusion of the three images corresponding to the lidar, radar, and RGB is shown in red.

**Figure 21**
The IoU and pixel accuracy are shown for the training mode.

**Figure 22**
Illustrates the loss entropy.

**Figure 23**
The blue line represents the loss of the training mode and the orange line indicates the loss of the validation mode.

**Figure 24**
The IoU and pixel accuracy are shown for the testing mode.

**Figure 25**
The figure displays in red the overlapping area between the ground truth and the model’s output. The ground truth is depicted in blue, and the white spots are part of the model’s output but do not overlap. The weak blue light over the pedestrians indicates a very weak detection by the model due to its overfitting.

See this image and copyright information in PMC

References

1. Alzubaidi L., Zhang J., Humaidi A.J., Al-dujaili A., Duan Y., Al-Shamma O., Santamaría J., Fadhel M.A., Al-Amidie M., Farhan L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data. 2021;8:53. doi: 10.1186/s40537-021-00444-8. - DOI - PMC - PubMed
1. Bimbraw K. Autonomous cars: Past, present and future a review of the developments in the last century, the present scenario and the expected future of autonomous vehicle technology; Proceedings of the 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO); Colmar, France. 21–23 July 2015; pp. 191–198.
1. Yao G., Lei T., Zhong J. A review of Convolutional-Neural-Network-based action recognition. Pattern Recognit. Lett. 2019;118:14–22. doi: 10.1016/j.patrec.2018.05.018. Cooperative and Social Robots: Understanding Human Activities and Intentions. - DOI
1. Soga M., Kato T., Ohta M., Ninomiya Y. Pedestrian Detection with Stereo Vision; Proceedings of the 21st International Conference on Data Engineering Workshops (ICDEW’05); Tokyo, Japan. 3–4 April 2005; p. 1200. - DOI
1. Yu X., Marinov M. A Study on Recent Developments and Issues with Obstacle Detection Systems for Automated Vehicles. Sustainability. 2020;12:3281. doi: 10.3390/su12083281. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

801342/Marie Skłodowska-Curie

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Affiliations

A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources