. 2023 Dec 24;24(1):95.

doi: 10.3390/s24010095.

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Xuegang Hu¹, Jing Feng²

Affiliations

¹ School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China.
² Chongqing Key Laboratory of Signal and Information Processing, Chongqing University of Posts and Telecommunications, Chongqing 400065, China.

PMID: 38202957
PMCID: PMC10781398
DOI: 10.3390/s24010095

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Xuegang Hu et al. Sensors (Basel). 2023.

. 2023 Dec 24;24(1):95.

doi: 10.3390/s24010095.

Authors

Xuegang Hu¹, Jing Feng²

Affiliations

¹ School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China.
² Chongqing Key Laboratory of Signal and Information Processing, Chongqing University of Posts and Telecommunications, Chongqing 400065, China.

PMID: 38202957
PMCID: PMC10781398
DOI: 10.3390/s24010095

Abstract

Semantic segmentation provides accurate scene understanding and decision support for many applications. However, many models strive for high accuracy by adopting complex structures, decreasing the inference speed, and making it challenging to meet real-time requirements. Therefore, a fast attention-guided hierarchical decoding network for real-time semantic segmentation (FAHDNet), which is an asymmetric U-shaped structure, is proposed to address this issue. In the encoder, we design a multi-scale bottleneck residual unit (MBRU), which combines the attention mechanism and decomposition convolution to design a parallel structure for aggregating multi-scale information, making the network perform better at processing information at different scales. In addition, we propose a spatial information compensation (SIC) module that effectively uses the original input to make up for the spatial texture information lost during downsampling. In the decoder, the global attention (GA) module is used to process the feature map of the encoder, enhance the feature interaction in the channel and spatial dimensions, and enhance the ability to mine feature information. At the same time, the lightweight hierarchical decoder integrates multi-scale features to better adapt to different scale targets and accurately segment objects of different sizes. Through experiments, FAHDNet performs outstandingly on two public datasets, Cityscapes and Camvid. Specifically, the network achieves 70.6% mean intersection over union (mIoU) at 135 frames per second (FPS) on Cityscapes and 67.2% mIoU at 335 FPS on Camvid. Compared to the existing networks, our model maintains accuracy while achieving faster inference speeds, thus enhancing its practical usability.

Keywords: attention mechanism; encoder–decoder network; feature fusion; real-time semantic segmentation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1**
(a) Depicts a double-branch structure; (b) illustrates an asymmetric encoder–decoder structure; (c) showcases a U-shaped structure. The yellow part here is the encoding part, and the blue part is the decoding part.

**Figure 2**
Model inference speed and segmentation accuracy on the Cityscapes dataset. In the diagram, the red triangle represents our method, the blue dots represent other methods, and the red dashed line represents the minimum requirements for real-time semantic segmentation.

**Figure 3**
The procedure of our proposed FAHDNet. ‘C’ denotes the concatenation, ‘+’ denotes the pixel level addition operation. The blue box represents the downsampling backbone network of the network, while the black dashed arrow indicates that it only exists during training and is discarded during the inference process.

**Figure 4**
(a) Represents the DAB module in DABNet, (b) depicts the SS-nbt module in LEDNet, (c) illustrates the EAR module in MSCFNet, and (d) showcases our MBRU module. ‘C’ represents the number of channels, ‘R’ represents the dilated convolution, and the dilated rate is R.

**Figure 5**
SCA module. ‘Avgpool’ represents average pooled downsampling, ‘Sigmiod’ represents activation function. ‘Compress’ refers to compressing features, while ‘extend’ means extending features.

**Figure 6**
SIC module. Here, $X_{o r i}$ , $X_{d o w n}$ represent the original input feature and the subsample feature.

**Figure 7**
GA module. ‘Avgpool’ represents average pooled downsampling, ‘Softmax’ represents softmax activation function. ‘View’ represents the transformation of features in terms of dimensionality.

**Figure 8**
(a) The FW module and (b) the segmentation head.

**Figure 9**
Visual comparisons in terms of the Cityscapes validation set. From left to right are original image, ground truth, segmentation outputs from ESPNet, LARFNet, and our FAHDNet. The white dashed block diagram highlights the important contrasting parts.

**Figure 10**
Visual comparisons in terms of the Camvid validation set. From left to right are original image, ground truth, and segmentation outputs of ESPNet, LARFNet, and our FAHDNet. The white dashed block diagram highlights the important contrasting parts.

See this image and copyright information in PMC

References

1. Papadeas I., Tsochatzidis L., Amanatiadis A., Pratikakis I. Real-time semantic image segmentation with deep learning for autonomous driving: A survey. Appl. Sci. 2021;11:8802. doi: 10.3390/app11198802. - DOI
1. Xu S., Wang J., Shou W., Ngo T., Sadick A.M., Wang X. Computer vision techniques in construction: A critical review. Arch. Comput. Methods Eng. 2021;28:3383–3397. doi: 10.1007/s11831-020-09504-3. - DOI
1. Yurtsever E., Lambert J., Carballo A., Takeda K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access. 2020;8:58443–58469. doi: 10.1109/ACCESS.2020.2983149. - DOI
1. Chen Z., Li J., Wang J., Wang S., Zhao J., Li J. Towards hybrid gait obstacle avoidance for a six wheel-legged robot with payload transportation. J. Intell. Robot. Syst. 2021;102:60. doi: 10.1007/s10846-021-01417-y. - DOI
1. Chen Z., Li J., Wang S., Wang J., Ma L. Flexible gait transition for six wheel-legged robot with unstructured terrains. Robot. Auton. Syst. 2022;150:103989. doi: 10.1016/j.robot.2021.103989. - DOI

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Affiliations

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous