ModuCLIP: multi-scale CLIP framework for predicting foundation pit deformation in multi-modal robotic systems

Lin Wenbo¹, Li Tingting², Li Xiao³

Affiliations

¹ School of Geology, Gansu Industrial Vocational and Technical College, Tianshui, Gansu, China.
² School of Electronic Information, Gansu Industrial Vocational and Technical College, Tianshui, Gansu, China.
³ Guangdong Nonferrous Industry Building Quality Inspection Co., Ltd., Guangzhou, Guangdong, China.

PMID: 40236467
PMCID: PMC11996866
DOI: 10.3389/fnbot.2025.1544694

ModuCLIP: multi-scale CLIP framework for predicting foundation pit deformation in multi-modal robotic systems

Lin Wenbo et al. Front Neurorobot. 2025.

. 2025 Apr 1:19:1544694.

doi: 10.3389/fnbot.2025.1544694. eCollection 2025.

Authors

Lin Wenbo¹, Li Tingting², Li Xiao³

Affiliations

¹ School of Geology, Gansu Industrial Vocational and Technical College, Tianshui, Gansu, China.
² School of Electronic Information, Gansu Industrial Vocational and Technical College, Tianshui, Gansu, China.
³ Guangdong Nonferrous Industry Building Quality Inspection Co., Ltd., Guangzhou, Guangdong, China.

PMID: 40236467
PMCID: PMC11996866
DOI: 10.3389/fnbot.2025.1544694

Abstract

Introduction: Foundation pit deformation prediction is a critical aspect of underground engineering safety assessment, influencing construction quality and personnel safety. However, due to complex geological conditions and numerous environmental interference factors, traditional prediction methods struggle to achieve precise modeling. Conventional approaches, including numerical simulations, empirical formulas, and machine learning models, suffer from limitations such as high computational costs, poor generalization, or excessive dependence on specific data distributions. Recently, deep learning models, particularly cross-modal architectures, have demonstrated great potential in engineering applications. However, effectively integrating multi-modal data for improved prediction accuracy remains a significant challenge.

Methods: This study proposes a Multi-Scale Contrastive Language-Image Pretraining (CLP) framework, ModuCLIP, designed for foundation pit deformation prediction in multi-modal robotic systems. The framework leverages a self-supervised contrastive learning mechanism to integrate multi-source information, including images, textual descriptions, and sensor data, while employing a multi-scale feature learning approach to enhance adaptability to complex conditions. Experiments conducted on multiple foundation pit engineering datasets demonstrate that ModuCLIP outperforms existing methods in terms of prediction accuracy, generalization, and robustness.

Results and discussion: The findings suggest that this framework provides an efficient and precise solution for foundation pit deformation prediction while offering new insights into multi-modal robotic perception and engineering monitoring applications.

Keywords: contrastive learning; deep learning; foundation pit deformation prediction; multi-modal robotics; multi-scale features.

PubMed Disclaimer

Conflict of interest statement

LX was employed by Guangdong Nonferrous Industry Building Quality Inspection Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Conceptual diagram of the foundation pit deformation monitoring and prediction framework.

**Figure 2**
The dynamic geotechnical learning network (DGL-Net) framework for predicting foundation pit deformation under complex geotechnical and environmental conditions. DGL-Net integrates a hybrid architecture combining physics-constrained deep learning with a dynamic state representation module to capture spatial-temporal features. The model incorporates multi-scale handling through a hierarchical attention mechanism, ensuring accurate and physically consistent predictions by enforcing constraints derived from physical laws and domain knowledge.

**Figure 3**
The hierarchical attention mechanism for multi-scale handling. This mechanism captures geotechnical data variations across different spatial scales by generating features at multiple resolutions. It uses a learnable attention mechanism to assign adaptive weights to each scale, allowing the model to focus on the most relevant features for accurate predictions. The features from different resolutions are fused and refined, enhancing spatial coherence and improving the model's ability to handle heterogeneous geotechnical conditions. Regularization ensures balanced attention distribution across scales to prevent overfitting.

**Figure 4**
The adaptive multi-scale integration strategy (AMIS) framework for predicting foundation pit deformation under complex geotechnical conditions. AMIS integrates multi-scale feature aggregation, temporal adaptation, and geotechnical constraints to model the dynamic and heterogeneous behavior of foundation pit deformation. By leveraging multi-scale convolutional operations and recurrent layers, the model adapts to spatial and temporal variations in soil properties and excavation processes. The integration of real-time feedback using a Kalman filter allows the model to continuously refine its predictions, ensuring accurate forecasting of displacement and strain over time. The attention mechanism dynamically focuses on critical regions with high spatial variability, enhancing the model's adaptability to real-world applications.

**Figure 5**
The integration of geotechnical constraints and real-time feedback framework for foundation pit deformation prediction. The system incorporates physical constraints, such as stress equilibrium and failure conditions, to ensure predictions align with geotechnical principles. An attention mechanism is used to prioritize regions with significant spatial variability in material properties, improving focus on critical areas. Real-time feedback is integrated using a Kalman filter-based approach, allowing the model to continuously update and refine predictions based on real-world observations, thus enhancing the adaptability and accuracy of the deformation forecasts over time.

**Figure 6**
Ablation study of our method on COCO and GeoNet Datasets.

**Figure 7**
Ablation study of our method on sEN12MS and SEN1-2 Datasets.

See this image and copyright information in PMC

References

1. Bayoudh K., Knani R., Hamdaoui F., Mtibaa A. (2021). A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38, 2939–2970. 10.1007/s00371-021-02166-7 - DOI - PMC - PubMed
1. Cao H., Zhang Z., Xia Y., Li X., Xia J., Chen G., et al. . (2024). “Embracing events and frames with hierarchical feature refinement network for object detection,” in European Conference on Computer Vision (Cham: Springer; ), 161–177. 10.1007/978-3-031-72907-2_10 - DOI
1. Chai W., Wang G. (2022). Deep vision multimodal learning: Methodology, benchmark, and trend. Appl. Sci. 12:6588. 10.3390/app12136588 - DOI
1. Chango W., Lara J., Cerezo R., Romero C. (2022). A review on data fusion in multimodal learning analytics and educational data mining. WIREs Data Mining Knowl. Discov. 12:e1458. 10.1002/widm.1458 - DOI
1. Cui Y., Wang Q., Li C., Ren W., Knoll A. (2025). Eenet: an effective and efficient network for single image dehazing. Pattern Recognit. 158:111074. 10.1016/j.patcog.2024.111074 - DOI

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ModuCLIP: multi-scale CLIP framework for predicting foundation pit deformation in multi-modal robotic systems

Affiliations

ModuCLIP: multi-scale CLIP framework for predicting foundation pit deformation in multi-modal robotic systems

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous