Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 24;24(1):95.
doi: 10.3390/s24010095.

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Affiliations

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Xuegang Hu et al. Sensors (Basel). .

Abstract

Semantic segmentation provides accurate scene understanding and decision support for many applications. However, many models strive for high accuracy by adopting complex structures, decreasing the inference speed, and making it challenging to meet real-time requirements. Therefore, a fast attention-guided hierarchical decoding network for real-time semantic segmentation (FAHDNet), which is an asymmetric U-shaped structure, is proposed to address this issue. In the encoder, we design a multi-scale bottleneck residual unit (MBRU), which combines the attention mechanism and decomposition convolution to design a parallel structure for aggregating multi-scale information, making the network perform better at processing information at different scales. In addition, we propose a spatial information compensation (SIC) module that effectively uses the original input to make up for the spatial texture information lost during downsampling. In the decoder, the global attention (GA) module is used to process the feature map of the encoder, enhance the feature interaction in the channel and spatial dimensions, and enhance the ability to mine feature information. At the same time, the lightweight hierarchical decoder integrates multi-scale features to better adapt to different scale targets and accurately segment objects of different sizes. Through experiments, FAHDNet performs outstandingly on two public datasets, Cityscapes and Camvid. Specifically, the network achieves 70.6% mean intersection over union (mIoU) at 135 frames per second (FPS) on Cityscapes and 67.2% mIoU at 335 FPS on Camvid. Compared to the existing networks, our model maintains accuracy while achieving faster inference speeds, thus enhancing its practical usability.

Keywords: attention mechanism; encoder–decoder network; feature fusion; real-time semantic segmentation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1
Figure 1
(a) Depicts a double-branch structure; (b) illustrates an asymmetric encoder–decoder structure; (c) showcases a U-shaped structure. The yellow part here is the encoding part, and the blue part is the decoding part.
Figure 2
Figure 2
Model inference speed and segmentation accuracy on the Cityscapes dataset. In the diagram, the red triangle represents our method, the blue dots represent other methods, and the red dashed line represents the minimum requirements for real-time semantic segmentation.
Figure 3
Figure 3
The procedure of our proposed FAHDNet. ‘C’ denotes the concatenation, ‘+’ denotes the pixel level addition operation. The blue box represents the downsampling backbone network of the network, while the black dashed arrow indicates that it only exists during training and is discarded during the inference process.
Figure 4
Figure 4
(a) Represents the DAB module in DABNet, (b) depicts the SS-nbt module in LEDNet, (c) illustrates the EAR module in MSCFNet, and (d) showcases our MBRU module. ‘C’ represents the number of channels, ‘R’ represents the dilated convolution, and the dilated rate is R.
Figure 5
Figure 5
SCA module. ‘Avgpool’ represents average pooled downsampling, ‘Sigmiod’ represents activation function. ‘Compress’ refers to compressing features, while ‘extend’ means extending features.
Figure 6
Figure 6
SIC module. Here, Xori,Xdown represent the original input feature and the subsample feature.
Figure 7
Figure 7
GA module. ‘Avgpool’ represents average pooled downsampling, ‘Softmax’ represents softmax activation function. ‘View’ represents the transformation of features in terms of dimensionality.
Figure 8
Figure 8
(a) The FW module and (b) the segmentation head.
Figure 9
Figure 9
Visual comparisons in terms of the Cityscapes validation set. From left to right are original image, ground truth, segmentation outputs from ESPNet, LARFNet, and our FAHDNet. The white dashed block diagram highlights the important contrasting parts.
Figure 10
Figure 10
Visual comparisons in terms of the Camvid validation set. From left to right are original image, ground truth, and segmentation outputs of ESPNet, LARFNet, and our FAHDNet. The white dashed block diagram highlights the important contrasting parts.

Similar articles

References

    1. Papadeas I., Tsochatzidis L., Amanatiadis A., Pratikakis I. Real-time semantic image segmentation with deep learning for autonomous driving: A survey. Appl. Sci. 2021;11:8802. doi: 10.3390/app11198802. - DOI
    1. Xu S., Wang J., Shou W., Ngo T., Sadick A.M., Wang X. Computer vision techniques in construction: A critical review. Arch. Comput. Methods Eng. 2021;28:3383–3397. doi: 10.1007/s11831-020-09504-3. - DOI
    1. Yurtsever E., Lambert J., Carballo A., Takeda K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access. 2020;8:58443–58469. doi: 10.1109/ACCESS.2020.2983149. - DOI
    1. Chen Z., Li J., Wang J., Wang S., Zhao J., Li J. Towards hybrid gait obstacle avoidance for a six wheel-legged robot with payload transportation. J. Intell. Robot. Syst. 2021;102:60. doi: 10.1007/s10846-021-01417-y. - DOI
    1. Chen Z., Li J., Wang S., Wang J., Ma L. Flexible gait transition for six wheel-legged robot with unstructured terrains. Robot. Auton. Syst. 2022;150:103989. doi: 10.1016/j.robot.2021.103989. - DOI

LinkOut - more resources