. 2024 May 1:18:1342126.

doi: 10.3389/fnbot.2024.1342126. eCollection 2024.

Improved object detection method for unmanned driving based on Transformers

Huaqi Zhao¹, Xiang Peng¹, Su Wang¹, Jun-Bao Li², Jeng-Shyang Pan³, Xiaoguang Su¹, Xiaomin Liu¹

Affiliations

¹ The Heilongjiang Provincial Key Laboratory of Autonomous Intelligence and Information Processing, School of Information and Electronic Technology, Jiamusi University, Jiamusi, China.
² Harbin Institute of Technology, Harbin, China.
³ School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China.

PMID: 38752022
PMCID: PMC11094364
DOI: 10.3389/fnbot.2024.1342126

Improved object detection method for unmanned driving based on Transformers

Huaqi Zhao et al. Front Neurorobot. 2024.

. 2024 May 1:18:1342126.

doi: 10.3389/fnbot.2024.1342126. eCollection 2024.

Authors

Huaqi Zhao¹, Xiang Peng¹, Su Wang¹, Jun-Bao Li², Jeng-Shyang Pan³, Xiaoguang Su¹, Xiaomin Liu¹

Affiliations

¹ The Heilongjiang Provincial Key Laboratory of Autonomous Intelligence and Information Processing, School of Information and Electronic Technology, Jiamusi University, Jiamusi, China.
² Harbin Institute of Technology, Harbin, China.
³ School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China.

PMID: 38752022
PMCID: PMC11094364
DOI: 10.3389/fnbot.2024.1342126

Abstract

The object detection method serves as the core technology within the unmanned driving perception module, extensively employed for detecting vehicles, pedestrians, traffic signs, and various objects. However, existing object detection methods still encounter three challenges in intricate unmanned driving scenarios: unsatisfactory performance in multi-scale object detection, inadequate accuracy in detecting small objects, and occurrences of false positives and missed detections in densely occluded environments. Therefore, this study proposes an improved object detection method for unmanned driving, leveraging Transformer architecture to address these challenges. First, a multi-scale Transformer feature extraction method integrated with channel attention is used to enhance the network's capability in extracting features across different scales. Second, a training method incorporating Query Denoising with Gaussian decay was employed to enhance the network's proficiency in learning representations of small objects. Third, a hybrid matching method combining Optimal Transport and Hungarian algorithms was used to facilitate the matching process between predicted and actual values, thereby enriching the network with more informative positive sample features. Experimental evaluations conducted on datasets including KITTI demonstrate that the proposed method achieves 3% higher mean Average Precision (mAP) than that of the existing methodologies.

Keywords: Transformer; feature extraction; object detection; optimal transport; query denoising.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Improved object detection method for unmanned driving based on Transformers.

**Figure 2**
Multi-scale transformer feature extraction method fused with channel attention.

**Figure 3**
Training method for query denoising with Gaussian decay.

**Figure 4**
Performance of adding noise to objects of different scales.

**Figure 5**
Hybrid matching method based on optimal transport and Hungarian.

**Figure 6**
Comparison of matching results between Hungarian matching and optimal transport matching.

**Figure 7**
Parametric analysis of the number of channel attention modules.

**Figure 8**
Parameter analysis of the training method for query denoising with Gaussian decay.

**Figure 9**
Weight parameter analysis of hybrid matching method.

**Figure 10**
AP convergence change diagram of the proposed method on the COCO-driving dataset. **(A)** mAP of all objects. **(B)** mAP of small objects.

**Figure 11**
Loss curves of the proposed method and the original method on the COCO-driving dataset.

**Figure 12**
The mAP convergence curves of the method proposed and other object detection methods. **(A)** mAP-all convergence curves. **(B)** mAP-small convergence curves.

**Figure 13**
Comparison of the detection performance of the proposed method and other object detection methods on the COCO-driving dataset for small objects. **(A)** is the original image, **(B)** is the detection image of DN-DAB-DETR, **(C)** is the detection image of Deformable-DETR, **(D)** is the detection image of Sparse-RCNN, **(E)** is the detection image of YOLOX, **(F)** is the detection image of YOLOv7, **(G)** is the DINO model detection image, **(H)** is the detection image of the proposed method.

**Figure 14**
Comparison of the detection performance of the proposed method and other object detection methods on the COCO-driving dataset for dense occlusion objects. **(A)** is the original image, **(B)** is the detection image of DN-DAB-DETR, **(C)** is the detection image of Deformable-DETR, **(D)** is the detection image of Sparse-RCNN, **(E)** is the detection image of YOLOX, **(F)** is the detection image of YOLOv7, **(G)** is the DINO model detection image, **(H)** is the detection image of the proposed method.

**Figure 15**
Comparison of the detection performance of the proposed method and other object detection methods on the WiderPerson dataset for small and dense occlusion objects is illustrated as follows: **(A)** denotes the original image, **(B)** represents the detection image of DN-DAB-DETR, **(C)** illustrates the detection image of Faster-RCNN, **(D)** represents the detection image of Sparse-RCNN, **(E)** denotes the detection image of YOLOX, **(F)** denotes the detection image of YOLOv7, **(G)** represents the DINO model detection image, and **(H)** represents the detection image of the proposed method.

**Figure 16**
Comparison of the detection performance of the proposed method and other object detection methods on the Waymo Open dataset is illustrated as follows: **(A)** represents the original image, **(B)** denotes the detection image of DN-DAB-DETR, **(C)** illustrates the detection image of Deformable-DETR, **(D)** indicates the detection image of Sparse-RCNN, **(E)** represents the detection image of YOLOX, **(F)** denotes the detection image of YOLOv7, **(G)** indicates the DINO model detection image, and **(H)** represent the detection image of the proposed method.

**Figure 17**
Comparison of the detection performance of the proposed method and other object detection methods on the KITTI dataset is illustrated as follows: **(A)** depicts the original image, while **(B)** illustrates the detection image of DN-DAB-DETR. Additionally, **(C)** represents the detection image of Faster-RCNN, **(D)** depicts the detection image of Sparse-RCNN, **(E)** shows the detection image of YOLOX, **(F)** illustrates the detection image of YOLOv7, **(G)** displays the DINO model detection image, and **(H)** showcases the detection image of the proposed method.

See this image and copyright information in PMC

References

1. Beal J., Kim E., Tzeng E., Park D. H., Zhai A., Kislyuk D., et al. . (2020). Toward transformer-based object detection. arXiv 1–11 [Preprint]. arXiv:2012.09958. 10.48550/arXiv:2012.09958 - DOI - PubMed
1. Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S., et al. . (2020). “End-to-end object detection with transformers,” in European conference on computer vision (Cham: Springer; ), 213–229. 10.1007/978-3-030-58452-8_13 - DOI
1. Cortes C., Vapnik V. (1995). Support-vector networks. Mach. Learn. 20, 273–297. 10.1007/BF00994018 - DOI
1. Dalal N., Triggs B. (2005). “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), Vol. 1 (San Diego, CA: IEEE; ), 886–893. 10.1109/CVPR.2005.177 - DOI
1. Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al. . (2020). An image is worth 16x16 words: transformers for image recognition at scale. arXiv 1–22 [Preprint]. arXiv:2010.11929. 10.48550/arXiv.2010.11929 - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improved object detection method for unmanned driving based on Transformers

Affiliations

Improved object detection method for unmanned driving based on Transformers

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources