Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 29;22(7):2623.
doi: 10.3390/s22072623.

Semantic Segmentation Using Pixel-Wise Adaptive Label Smoothing via Self-Knowledge Distillation for Limited Labeling Data

Affiliations

Semantic Segmentation Using Pixel-Wise Adaptive Label Smoothing via Self-Knowledge Distillation for Limited Labeling Data

Sangyong Park et al. Sensors (Basel). .

Abstract

To achieve high performance, most deep convolutional neural networks (DCNNs) require a significant amount of training data with ground truth labels. However, creating ground-truth labels for semantic segmentation requires more time, human effort, and cost compared with other tasks such as classification and object detection, because the ground-truth label of every pixel in an image is required. Hence, it is practically demanding to train DCNNs using a limited amount of training data for semantic segmentation. Generally, training DCNNs using a limited amount of data is problematic as it easily results in a decrease in the accuracy of the networks because of overfitting to the training data. Here, we propose a new regularization method called pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation to stably train semantic segmentation networks in a practical situation, in which only a limited amount of training data is available. To mitigate the problem caused by limited training data, our method fully utilizes the internal statistics of pixels within an input image. Consequently, the proposed method generates a pixel-wise aggregated probability distribution using a similarity matrix that encodes the affinities between all pairs of pixels. To further increase the accuracy, we add one-hot encoded distributions with ground-truth labels to these aggregated distributions, and obtain our final soft labels. We demonstrate the effectiveness of our method for the Cityscapes dataset and the Pascal VOC2012 dataset using limited amounts of training data, such as 10%, 30%, 50%, and 100%. Based on various quantitative and qualitative comparisons, our method demonstrates more accurate results compared with previous methods. Specifically, for the Cityscapes test set, our method achieved mIoU improvements of 0.076%, 1.848%, 1.137%, and 1.063% for 10%, 30%, 50%, and 100% training data, respectively, compared with the method of the cross-entropy loss using one-hot encoding with ground truth labels.

Keywords: limited training data; regularization; self-knowledge distillation; semantic segmentation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
A schematic flowchart of our method. Our method aggregates distributions based on pair-wise feature similarity and generates a pixel-wise soft label by weighted sum of a one-hot encoding with ground truth label and the aggregated distribution for each pixel according to training iteration.
Figure 2
Figure 2
Comparative results of methods trained using various ratios of limited training data. Results of various ratios of training data including 10%, 30%, 50%, and 100% are shown. Value below each result represents mIoU.
Figure 3
Figure 3
Overview of the proposed method, which is categorized into training and test paths. Blue and red arrows represent training and test paths, respectively.
Figure 4
Figure 4
Process of our PALS module.
Figure 5
Figure 5
Process of PA module, where (·) denotes the downsampling operation.
Figure 6
Figure 6
Results of the comparison of various methods using limited training data for DeepLab-V3+ [10] with the Xception65 [76] network on the Cityscapes dataset. (a) Input image. (b) Ground-truth image. (c) CE [10] result. (d) CP [22] result. (e) LS [20] result. (f) Our result.
Figure 7
Figure 7
Results of the comparison of various methods using limited training data for DeepLab-V3+ [10] with the ResNet18 [77] network on the Cityscapes dataset. (a) Input image. (b) Ground-truth image. (c) CE [10] result. (d) CP [22] result. (e) LS [20] result. (f) Our result.
Figure 8
Figure 8
Results of the comparison of various methods using limited training data for DeepLab-V3+ [10] with the Xception65 [76] network on the Pascal VOC2012 dataset. (a) Input image. (b) Ground-truth image. (c) CE [10] result. (d) CP [22] result. (e) LS [20] result. (f) Our result.

Similar articles

Cited by

References

    1. Zeng W., Luo W., Suo S., Sadat A., Yang B., Casas S., Urtasun R. End-To-End Interpretable Neural Motion Planner; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA. 15–20 June 2019; pp. 8652–8661.
    1. Philion J., Fidler S. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D; Proceedings of the European Conference on Computer Vision (ECCV); Glasgow, UK. 23–28 August 2020.
    1. Cherabier I.F., Schönberger J.L., Oswald M.R., Pollefeys M., Geiger A. Learning Priors for Semantic 3D Reconstruction; Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany. 8–14 September 2018; pp. 314–330.
    1. Ronneberger O., Fischer P., Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI); Munich, Germany. 5–9 October 2015; pp. 234–241.
    1. Srivastava A., Jha D., Chanda S., Pal U., Johansen H.D., Johansen D., Riegler M.A., Ali S., Halvorsen P. MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation. arXiv. 2021 doi: 10.1109/JBHI.2021.3138024.2105.07451 - DOI - PubMed

LinkOut - more resources