. 2024 Sep 27;14(1):22100.

doi: 10.1038/s41598-024-73853-3.

DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Gaoshuang Huang¹, Yang Zhou², Xiaofei Hu¹, Chenglong Zhang¹, Luying Zhao¹, Wenjian Gan¹

Affiliations

¹ Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou, 450001, China.
² Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou, 450001, China. zhouyang3d@163.com.

PMID: 39333370
PMCID: PMC11437288
DOI: 10.1038/s41598-024-73853-3

DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Gaoshuang Huang et al. Sci Rep. 2024.

. 2024 Sep 27;14(1):22100.

doi: 10.1038/s41598-024-73853-3.

Authors

Gaoshuang Huang¹, Yang Zhou², Xiaofei Hu¹, Chenglong Zhang¹, Luying Zhao¹, Wenjian Gan¹

Affiliations

¹ Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou, 450001, China.
² Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou, 450001, China. zhouyang3d@163.com.

PMID: 39333370
PMCID: PMC11437288
DOI: 10.1038/s41598-024-73853-3

Abstract

Using visual place recognition (VPR) technology to ascertain the geographical location of publicly available images is a pressing issue. Although most current VPR methods achieve favorable results under ideal conditions, their performance in complex environments, characterized by lighting variations, seasonal changes, and occlusions, is generally unsatisfactory. Therefore, obtaining efficient and robust image feature descriptors in complex environments is a pressing issue. In this study, we utilized the DINOv2 model as the backbone for trimming and fine-tuning to extract robust image features and employed a feature mix module to aggregate image features, resulting in globally robust and generalizable descriptors that enable high-precision VPR. We experimentally demonstrated that the proposed DINO-Mix outperforms the current state-of-the-art (SOTA) methods. Using test sets having lighting variations, seasonal changes, and occlusions such as Tokyo24/7, Nordland, and SF-XL-Testv1, our proposed architecture achieved Top-1 accuracy rates of 91.75%, 80.18%, and 82%, respectively, and exhibited an average accuracy improvement of 5.14%. In addition, we compared it with other SOTA methods using representative image retrieval case studies, and our architecture outperformed its competitors in terms of VPR performance. Furthermore, we visualized the attention maps of DINO-Mix and other methods to provide a more intuitive understanding of their respective strengths. These visualizations serve as compelling evidence of the superiority of the DINO-Mix framework in this domain.

Keywords: DINOv2; Feature mixer; Foundational vision model; Image retrieval; Visual place recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Visual place recognition (VPR) framework of DINO-Mix. The framework consists of two parts: firstly, the layer norms and head of the foundational vision model DINOv2 are pruned away and used as a backbone to extract the image feature vectors, and then the obtained feature vectors with dimensions *N × D* are transformed into *h × w × s* feature maps, and then robust global feature vectors are obtained by feature aggregation in the Mix module. During training, the front block in the backbone is frozen, and the parameters of the following few blocks and the Mix module are updated.

**Fig. 2**
Structural diagram of the DINOv2 model and feature transformation.

**Fig. 4**
Top-1 accuracy of DINO-Mix and other VPR methods for different test sets.

**Fig. 5**
Ablation on the number of feature mix blocks for each dataset.

**Fig. 6**
Ablation on the dimensionality levels for each dataset.

**Fig. 7**
DINO-Mix models with different weights for updating the number of layers.

**Fig. 8**
Comparison of VPR results (Top-1) of DINO-Mix with other methods in complex cases. (a) Successful VPR cases of DINO-Mix. (b) Failed VPR cases of DINO-Mix. The green and red boxes in the table represent image retrieval success and failure, respectively. The yellow box represents correct image content, but the localization distance exceeds the threshold s.

**Fig. 9**
Attention map visualization of the query image.

See this image and copyright information in PMC

References

1. Middelberg, S., Sattler, T., Untzelmann, O. & Kobbelt, L. Scalable 6-DOF Localization on Mobile Devices. in (eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) vol. 8690 268–283 (2014).
1. Suenderhauf, N. et al. Place Recognition with ConvNet Landmarks: Viewpoint-Robust, Condition-Robust, Training-Free. in Robotics: Science and Systems XI (Robotics: Science and Systems Foundation, doi: (2015). 10.15607/RSS.2015.XI.022
1. Chaabane, M., Gueguen, L., Trabelsi, A., Beveridge, R. & O’Hara, S. End-to-end Learning Improves Static Object Geo-localization from Video. in Ieee Winter Conference on Applications of Computer Vision Wacv 2021 2062–2071 (Ieee, New York, 2021). doi: (2021). 10.1109/WACV48630.2021.00211
1. Wilson, D. et al. Object Tracking and Geo-localization from Street images. Remote Sens. 14, 2575 (2022). - DOI
1. Agarwal, S., Snavely, N., Simon, I., Seitz, S. M. & Szeliski, R. Building Rome in a Day. in IEEE 12th International Conference on Computer Vision (ICCV) 72–79 (2009). doi: (2009). 10.1109/ICCV.2009.5459148

Grants and funding

42001338/National Natural Science Foundation of China

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Affiliations

DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials