ModuCLIP: multi-scale CLIP framework for predicting foundation pit deformation in multi-modal robotic systems
- PMID: 40236467
- PMCID: PMC11996866
- DOI: 10.3389/fnbot.2025.1544694
ModuCLIP: multi-scale CLIP framework for predicting foundation pit deformation in multi-modal robotic systems
Abstract
Introduction: Foundation pit deformation prediction is a critical aspect of underground engineering safety assessment, influencing construction quality and personnel safety. However, due to complex geological conditions and numerous environmental interference factors, traditional prediction methods struggle to achieve precise modeling. Conventional approaches, including numerical simulations, empirical formulas, and machine learning models, suffer from limitations such as high computational costs, poor generalization, or excessive dependence on specific data distributions. Recently, deep learning models, particularly cross-modal architectures, have demonstrated great potential in engineering applications. However, effectively integrating multi-modal data for improved prediction accuracy remains a significant challenge.
Methods: This study proposes a Multi-Scale Contrastive Language-Image Pretraining (CLP) framework, ModuCLIP, designed for foundation pit deformation prediction in multi-modal robotic systems. The framework leverages a self-supervised contrastive learning mechanism to integrate multi-source information, including images, textual descriptions, and sensor data, while employing a multi-scale feature learning approach to enhance adaptability to complex conditions. Experiments conducted on multiple foundation pit engineering datasets demonstrate that ModuCLIP outperforms existing methods in terms of prediction accuracy, generalization, and robustness.
Results and discussion: The findings suggest that this framework provides an efficient and precise solution for foundation pit deformation prediction while offering new insights into multi-modal robotic perception and engineering monitoring applications.
Keywords: contrastive learning; deep learning; foundation pit deformation prediction; multi-modal robotics; multi-scale features.
Copyright © 2025 Wenbo, Tingting and Xiao.
Conflict of interest statement
LX was employed by Guangdong Nonferrous Industry Building Quality Inspection Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures







Similar articles
-
Weakly supervised multi-modal contrastive learning framework for predicting the HER2 scores in breast cancer.Comput Med Imaging Graph. 2025 Apr;121:102502. doi: 10.1016/j.compmedimag.2025.102502. Epub 2025 Feb 3. Comput Med Imaging Graph. 2025. PMID: 39919535
-
Multi-modal contrastive mutual learning and pseudo-label re-learning for semi-supervised medical image segmentation.Med Image Anal. 2023 Jan;83:102656. doi: 10.1016/j.media.2022.102656. Epub 2022 Oct 17. Med Image Anal. 2023. PMID: 36327656
-
A spatiotemporal correlation and attention-based model for pipeline deformation prediction in foundation pit engineering.Sci Rep. 2024 Nov 2;14(1):26387. doi: 10.1038/s41598-024-77601-5. Sci Rep. 2024. PMID: 39488572 Free PMC article.
-
ADFound: A Foundation Model for Diagnosis and Prognosis of Alzheimer's Disease.IEEE J Biomed Health Inform. 2025 Jun 3;PP. doi: 10.1109/JBHI.2025.3576436. Online ahead of print. IEEE J Biomed Health Inform. 2025. PMID: 40460008
-
Hydrocarbon migration and accumulation simulation: A review and a novel multi-scale quantitative numerical simulation method.Adv Colloid Interface Sci. 2025 Aug;342:103523. doi: 10.1016/j.cis.2025.103523. Epub 2025 Apr 27. Adv Colloid Interface Sci. 2025. PMID: 40318382 Review.
References
-
- Cao H., Zhang Z., Xia Y., Li X., Xia J., Chen G., et al. . (2024). “Embracing events and frames with hierarchical feature refinement network for object detection,” in European Conference on Computer Vision (Cham: Springer; ), 161–177. 10.1007/978-3-031-72907-2_10 - DOI
-
- Chai W., Wang G. (2022). Deep vision multimodal learning: Methodology, benchmark, and trend. Appl. Sci. 12:6588. 10.3390/app12136588 - DOI
-
- Chango W., Lara J., Cerezo R., Romero C. (2022). A review on data fusion in multimodal learning analytics and educational data mining. WIREs Data Mining Knowl. Discov. 12:e1458. 10.1002/widm.1458 - DOI
-
- Cui Y., Wang Q., Li C., Ren W., Knoll A. (2025). Eenet: an effective and efficient network for single image dehazing. Pattern Recognit. 158:111074. 10.1016/j.patcog.2024.111074 - DOI
LinkOut - more resources
Full Text Sources
Miscellaneous