. 2022 Jul 18;12(1):12229.

doi: 10.1038/s41598-022-16415-9.

Offset-decoupled deformable convolution for efficient crowd counting

Xin Zhong¹, Jing Qin¹, Mingyue Guo², Wangmeng Zuo³, Weigang Lu⁴

Affiliations

¹ Department of Educational Technology, Ocean University of China, Qingdao, 266100, China.
² Department of Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China.
³ Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
⁴ Department of Educational Technology, Ocean University of China, Qingdao, 266100, China. luweigang@ouc.edu.cn.

PMID: 35851829
PMCID: PMC9293988
DOI: 10.1038/s41598-022-16415-9

Offset-decoupled deformable convolution for efficient crowd counting

Xin Zhong et al. Sci Rep. 2022.

. 2022 Jul 18;12(1):12229.

doi: 10.1038/s41598-022-16415-9.

Authors

Xin Zhong¹, Jing Qin¹, Mingyue Guo², Wangmeng Zuo³, Weigang Lu⁴

Affiliations

¹ Department of Educational Technology, Ocean University of China, Qingdao, 266100, China.
² Department of Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China.
³ Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
⁴ Department of Educational Technology, Ocean University of China, Qingdao, 266100, China. luweigang@ouc.edu.cn.

PMID: 35851829
PMCID: PMC9293988
DOI: 10.1038/s41598-022-16415-9

Abstract

Crowd counting is considered a challenging issue in computer vision. One of the most critical challenges in crowd counting is considering the impact of scale variations. Compared with other methods, better performance is achieved with CNN-based methods. However, given the limit of fixed geometric structures, the head-scale features are not completely obtained. Deformable convolution with additional offsets is widely used in the fields of image classification and pattern recognition, as it can successfully exploit the potential of spatial information. However, owing to the randomly generated parameters of offsets in network initialization, the sampling points of the deformable convolution are disorderly stacked, weakening the effectiveness of feature extraction. To handle the invalid learning of offsets and the inefficient utilization of deformable convolution, an offset-decoupled deformable convolution (ODConv) is proposed in this paper. It can completely obtain information within the effective region of sampling points, leading to better performance. In extensive experiments, average MAE of 62.3, 8.3, 91.9, and 159.3 are achieved using our method on the ShanghaiTech A, ShanghaiTech B, UCF-QNRF, and UCF_CC_50 datasets, respectively, outperforming the state-of-the-art methods and validating the effectiveness of the proposed ODConv.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Visualization of density maps predicted from models trained with DConv. (a) is one of the input images in the ShanghaiTech B dataset, (b) is the ground truth, and the estimated density map is shown in (c) which shows less regular Gaussian blobs.

**Figure 2**
An illustration of a conventional DConv, in which the offsets are obtained directly from the input feature.

**Figure 3**
Illustration of our ODConv. The scale map and the pre_offset map are represented by blue and orange parallelograms, respectively. The offsets are obtained from the product of the pre_offset map and the scale map.

**Figure 4**
Conventional offset-based deformable convolution is presented in (a), and illustrations of the learning process of offsets in offset-decoupled deformable convolution are shown in (b) and (c). The sampling points are represented by the balls. Among them, the colors of typical convolution sampling points (a–i) and the actual sampling points (A–I) are pink and red, respectively. In addition, offsets are indicated by the dark red arrows.

**Figure 5**
The architecture of the ODConv network. The backbone of CSRNet is replaced with VGG16-BN by inserting the batch normalization layer after each dilated convolution. Then, the last layer of dilated convolution is replaced by offset-decoupled deformable convolution, and the network is defined as our ODConv.

**Figure 6**
Visualization of an image from the ShanghaiTech B dataset. The first column shows one of the samples and its ground truth devoted as (a) and (b), respectively. The predicted density map in DConv and ODConv is shown in (c) and (e), and the visualization of offsets of DConv and ODConv is presented in (d) and (f).

**Figure 7**
Training curves of ODConv and DConv on the UCF-QNRF. The training process with $L_{s}$ and $L_{p}$ is indicated by the orange solid line, and another gray dotted line presents the training process without $L_{s}$ *and* $L_{p}$ .

**Figure 8**
The comparisons of DConv and our ODConv with different weights of the scale map and decay rates.

**Figure 9**
Comparisons of the ODConv and DConv on the ResNet-50 and CSRNet are shown in (a) and (b), respectively, and comparisons of the ODConv on the CSRNet and ResNet-50 are shown in (c). The results of the significance level are indicated by the crimson characters on the top of each figure.

See this image and copyright information in PMC

References

1. Q. Wang, J. Gao & W. Lin. NWPU-crowd: A large-scale benchmark for crowd counting and localization. in IEEE Transactions on Pattern Analysis and Machine Intelligence, 3013269 (2020). - PubMed
1. Mazzeo PL, Contino R, Spagnolo P. MH-MetroNet-a multi-head CNN for passenger-crowd attendance estimation. J. Imaging. 2020;6(7):62–76. doi: 10.3390/jimaging6070062. - DOI - PMC - PubMed
1. V. A. Sindagi & V. M. Patel. Generating high-quality crowd density maps using contextual pyramid cnns. in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1879–1888 (2017).
1. Feris RS, Siddiquie B, Petterson J. Large-scale vehicle detection, indexing & search in urban surveillance videos. IEEE Trans. Multimed. 2012;14(1):28–42. doi: 10.1109/TMM.2011.2170666. - DOI
1. Wang G, Li B, Zhang Y, Yang J. Background modeling and referencing for moving cameras-captured surveillance video coding in hevc. IEEE Trans. Multimed. 2018;20(11):2921–2934. doi: 10.1109/TMM.2018.2829163. - DOI

Publication types

Actions

MeSH terms

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Offset-decoupled deformable convolution for efficient crowd counting

Affiliations

Offset-decoupled deformable convolution for efficient crowd counting

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources