Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Rui Tao^{1

2}, Meng Zhu³, Haiyan Cao², Honge Ren^{1

4}

Affiliations

¹ College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
² College of Artificial Intelligence and Big Data, Hulunbuir University, Hulunbuir 021008, China.
³ College of Information Engineering, Harbin University, Harbin 150076, China.
⁴ Heilongjiang Forestry Intelligent Equipment Engineering Research Center, Harbin 150040, China.

PMID: 38793984
PMCID: PMC11125332
DOI: 10.3390/s24103130

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Rui Tao et al. Sensors (Basel). 2024.

. 2024 May 14;24(10):3130.

doi: 10.3390/s24103130.

Authors

Rui Tao^{1

2}, Meng Zhu³, Haiyan Cao², Honge Ren^{1

4}

Affiliations

¹ College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
² College of Artificial Intelligence and Big Data, Hulunbuir University, Hulunbuir 021008, China.
³ College of Information Engineering, Harbin University, Harbin 150076, China.
⁴ Heilongjiang Forestry Intelligent Equipment Engineering Research Center, Harbin 150040, China.

PMID: 38793984
PMCID: PMC11125332
DOI: 10.3390/s24103130

Abstract

Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image-text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image-text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image-text conservation area image datasets.

Keywords: cross-modal; cross-modal alignment; cross-modal retrieval; image captioning; multi-task.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

Figures

**Figure A1**
An example of knowledge distillation from a pre-trained model.

**Figure A2**
The structural details of the distilled encoder module.

**Figure 1**
An application instance of the ReCap model.

**Figure 2**
An overview of training ReCap for cross-modal semantics consistency.

**Figure 3**
Residual Attention Network Architecture.

**Figure 5**
Redundant disambiguation momentum encoding.

**Figure 6**
Unimodal encoding momentum update.

**Figure 7**
Examples of text-to-image retrieval on validation dataset.

See this image and copyright information in PMC

References

1. Matin M., Shrestha T., Chitale V., Thomas S. Exploring the potential of deep learning for classifying camera trap data of wildlife: A case study from Nepal; Proceedings of the AGU Fall Meeting Abstracts; New Orleans, LA, USA. 13–17 December 2021; p. GC45I-0923.
1. Norouzzadeh M.S., Nguyen A., Kosmala M., Swanson A., Palmer M.S., Packer C., Clune J. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. USA. 2018;115:E5716–E5725. doi: 10.1073/pnas.1719367115. - DOI - PMC - PubMed
1. Zett T., Stratford K.J., Weise F. Inter-observer variance and agreement of wildlife information extracted from camera trap images. Biodivers. Conserv. 2022;31:3019–3037. doi: 10.1007/s10531-022-02472-z. - DOI
1. Swanson A., Kosmala M., Lintott C., Simpson R., Smith A., Packer C. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci. Data. 2015;2:1–14. doi: 10.1038/sdata.2015.26. - DOI - PMC - PubMed
1. McShea W.J., Forrester T., Costello R., He Z., Kays R. Volunteer-run cameras as distributed sensors for macrosystem mammal research. Landsc. Ecol. 2016;31:55–66. doi: 10.1007/s10980-015-0262-9. - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Affiliations

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources