Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 1:18:1342126.
doi: 10.3389/fnbot.2024.1342126. eCollection 2024.

Improved object detection method for unmanned driving based on Transformers

Affiliations

Improved object detection method for unmanned driving based on Transformers

Huaqi Zhao et al. Front Neurorobot. .

Abstract

The object detection method serves as the core technology within the unmanned driving perception module, extensively employed for detecting vehicles, pedestrians, traffic signs, and various objects. However, existing object detection methods still encounter three challenges in intricate unmanned driving scenarios: unsatisfactory performance in multi-scale object detection, inadequate accuracy in detecting small objects, and occurrences of false positives and missed detections in densely occluded environments. Therefore, this study proposes an improved object detection method for unmanned driving, leveraging Transformer architecture to address these challenges. First, a multi-scale Transformer feature extraction method integrated with channel attention is used to enhance the network's capability in extracting features across different scales. Second, a training method incorporating Query Denoising with Gaussian decay was employed to enhance the network's proficiency in learning representations of small objects. Third, a hybrid matching method combining Optimal Transport and Hungarian algorithms was used to facilitate the matching process between predicted and actual values, thereby enriching the network with more informative positive sample features. Experimental evaluations conducted on datasets including KITTI demonstrate that the proposed method achieves 3% higher mean Average Precision (mAP) than that of the existing methodologies.

Keywords: Transformer; feature extraction; object detection; optimal transport; query denoising.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Improved object detection method for unmanned driving based on Transformers.
Figure 2
Figure 2
Multi-scale transformer feature extraction method fused with channel attention.
Figure 3
Figure 3
Training method for query denoising with Gaussian decay.
Figure 4
Figure 4
Performance of adding noise to objects of different scales.
Figure 5
Figure 5
Hybrid matching method based on optimal transport and Hungarian.
Figure 6
Figure 6
Comparison of matching results between Hungarian matching and optimal transport matching.
Figure 7
Figure 7
Parametric analysis of the number of channel attention modules.
Figure 8
Figure 8
Parameter analysis of the training method for query denoising with Gaussian decay.
Figure 9
Figure 9
Weight parameter analysis of hybrid matching method.
Figure 10
Figure 10
AP convergence change diagram of the proposed method on the COCO-driving dataset. (A) mAP of all objects. (B) mAP of small objects.
Figure 11
Figure 11
Loss curves of the proposed method and the original method on the COCO-driving dataset.
Figure 12
Figure 12
The mAP convergence curves of the method proposed and other object detection methods. (A) mAP-all convergence curves. (B) mAP-small convergence curves.
Figure 13
Figure 13
Comparison of the detection performance of the proposed method and other object detection methods on the COCO-driving dataset for small objects. (A) is the original image, (B) is the detection image of DN-DAB-DETR, (C) is the detection image of Deformable-DETR, (D) is the detection image of Sparse-RCNN, (E) is the detection image of YOLOX, (F) is the detection image of YOLOv7, (G) is the DINO model detection image, (H) is the detection image of the proposed method.
Figure 14
Figure 14
Comparison of the detection performance of the proposed method and other object detection methods on the COCO-driving dataset for dense occlusion objects. (A) is the original image, (B) is the detection image of DN-DAB-DETR, (C) is the detection image of Deformable-DETR, (D) is the detection image of Sparse-RCNN, (E) is the detection image of YOLOX, (F) is the detection image of YOLOv7, (G) is the DINO model detection image, (H) is the detection image of the proposed method.
Figure 15
Figure 15
Comparison of the detection performance of the proposed method and other object detection methods on the WiderPerson dataset for small and dense occlusion objects is illustrated as follows: (A) denotes the original image, (B) represents the detection image of DN-DAB-DETR, (C) illustrates the detection image of Faster-RCNN, (D) represents the detection image of Sparse-RCNN, (E) denotes the detection image of YOLOX, (F) denotes the detection image of YOLOv7, (G) represents the DINO model detection image, and (H) represents the detection image of the proposed method.
Figure 16
Figure 16
Comparison of the detection performance of the proposed method and other object detection methods on the Waymo Open dataset is illustrated as follows: (A) represents the original image, (B) denotes the detection image of DN-DAB-DETR, (C) illustrates the detection image of Deformable-DETR, (D) indicates the detection image of Sparse-RCNN, (E) represents the detection image of YOLOX, (F) denotes the detection image of YOLOv7, (G) indicates the DINO model detection image, and (H) represent the detection image of the proposed method.
Figure 17
Figure 17
Comparison of the detection performance of the proposed method and other object detection methods on the KITTI dataset is illustrated as follows: (A) depicts the original image, while (B) illustrates the detection image of DN-DAB-DETR. Additionally, (C) represents the detection image of Faster-RCNN, (D) depicts the detection image of Sparse-RCNN, (E) shows the detection image of YOLOX, (F) illustrates the detection image of YOLOv7, (G) displays the DINO model detection image, and (H) showcases the detection image of the proposed method.

Similar articles

Cited by

References

    1. Beal J., Kim E., Tzeng E., Park D. H., Zhai A., Kislyuk D., et al. . (2020). Toward transformer-based object detection. arXiv 1–11 [Preprint]. arXiv:2012.09958. 10.48550/arXiv:2012.09958 - DOI - PubMed
    1. Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S., et al. . (2020). “End-to-end object detection with transformers,” in European conference on computer vision (Cham: Springer; ), 213–229. 10.1007/978-3-030-58452-8_13 - DOI
    1. Cortes C., Vapnik V. (1995). Support-vector networks. Mach. Learn. 20, 273–297. 10.1007/BF00994018 - DOI
    1. Dalal N., Triggs B. (2005). “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), Vol. 1 (San Diego, CA: IEEE; ), 886–893. 10.1109/CVPR.2005.177 - DOI
    1. Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al. . (2020). An image is worth 16x16 words: transformers for image recognition at scale. arXiv 1–22 [Preprint]. arXiv:2010.11929. 10.48550/arXiv.2010.11929 - DOI

LinkOut - more resources