Mixture of prompts learning for vision-language models

Yu Du^#^{1

2

3

4

5}, Tong Niu^#^{1

2

3

4

5}, Rong Zhao^{1

2

3

4

5}

Affiliations

¹ Center for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, China.
² Optical Memory National Engineering Research Center, Tsinghua University, Beijing, China.
³ Department of Precision Instrument, Tsinghua University, Beijing, China.
⁴ IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China.
⁵ CETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, China.

^# Contributed equally.

PMID: 40556640
PMCID: PMC12185420
DOI: 10.3389/frai.2025.1580973

Mixture of prompts learning for vision-language models

Yu Du et al. Front Artif Intell. 2025.

. 2025 Jun 10:8:1580973.

doi: 10.3389/frai.2025.1580973. eCollection 2025.

Authors

Yu Du^#^{1

2

3

4

5}, Tong Niu^#^{1

2

3

4

5}, Rong Zhao^{1

2

3

4

5}

Affiliations

¹ Center for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, China.
² Optical Memory National Engineering Research Center, Tsinghua University, Beijing, China.
³ Department of Precision Instrument, Tsinghua University, Beijing, China.
⁴ IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China.
⁵ CETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, China.

^# Contributed equally.

PMID: 40556640
PMCID: PMC12185420
DOI: 10.3389/frai.2025.1580973

Abstract

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture-of-prompts learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically select the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retains knowledge from hard prompts and improves selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applying a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. Our approach establishes that multi-prompt specialization with knowledge-preserving routing effectively bridges the adaptability-generalization tradeoff in VLM deployment. The code will be available at https://github.com/dyabel/mocoop.

Keywords: few-shot classification; mixture-of-experts; multi-modal; prompt learning; vision-language model.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Diagram illustrating the categorization of images with different styles into hard prompts, including people's activities, art and renditions, low-quality photos, and cropped photos. Each category contains specific examples, such as “A photo of a person doing [class]” and “Art of the [class].” A flowchart below this shows how soft prompts combine with image features to determine the class. — **Figure 1**
For the image classification task based on CLIP (Radford et al., 2021), hard templates can be grouped into different sets based on the contexts and patterns they describe in the images (e.g., varying contents within differently colored blocks). Different images are usually present with different context styles and a single image may simultaneously exhibit multiple styles. Traditionally, only one soft prompt is used to represent all images, which limits adaptability. In contrast, our method utilizes multiple soft prompts, with each soft prompt representing a distinct context. A routing module dynamically selects the most suitable prompts for each image. By accounting for different styles, this approach more effectively bridges the gap between visual and textual features.

Diagram of a machine learning workflow illustrating three training objectives. Image features are extracted via an image encoder and routed with learnable contexts. Text encoding involves hard prompts and features, guiding text-level supervision. The process aims to minimize KL divergence, match soft and hard features, and maximize classification scores, resulting in a similarity score for each class. Trainable and frozen components are identified, with workflow paths for training and inference marked. — **Figure 2**
Overview of MoCoOp. The orange lines signify the extra flow for training while the black lines are shared by training and inference. During inference, two soft prompts with the highest probabilities are selected and combined with the available classes for text encoding. The resulting text features are averaged and used for classification. During training, the hard prompt guided routing and semantically grouped text level supervision are introduced to supervise the router and soft prompts respectively.

Nine line graphs comparing accuracy percentages across various datasets, such as StanfordCars, OxfordPets, and ImageNet, are shown. Each graph plots accuracy against shots per class for techniques like Linear Probe, CoOp, ProGrad, and MoCoOp. MoCoOp generally maintains the highest accuracy across all datasets. — **Figure 3**
The few-shot learning results on 11 datasets. We plot the results across 1,2,4,8,16 shots. It can be seen that our MoCoOp consistently and significantly surpasses CoOp (Zhou et al., 2022b), ProGrad (Zhu et al., 2023), and the Linear Probe approach across most datasets. This is evident in the average accuracy displayed in the top left corner.

Two line graphs depict the effect of lambda on 4-shot average accuracy. The left graph shows varying lambda one with lambda two constant at five, peaking at 71.24% accuracy at lambda one equals two. The right graph shows varying lambda two with lambda one constant at one, peaking at 71.45% accuracy at lambda two equals one. Both graphs have a downward trend after the peak. — **Figure 4**
Ablation study on the sensitivity to hyper-parameters λ₁ and λ₂.

Upper image shows a sketch of a reptile alongside a bar chart. The chart compares MoCoOp and CoOp predictions: crocodile and octopus. Bars indicate likelihood with varying weights, highlighting “a sketch of” at 0.2042. Lower image features a sailing boat on water with a bar chart. The chart contrasts predictions: ketch and schooner. Bars show prediction weights, emphasizing “a low resolution photo of” at 0.2087. — **Figure 5**
Visualization of prompt selection across different image samples. The router dynamically selects the most suitable prompts based on the visual content of each image, while the traditional method, such as the CoOp gives the wrong answer.

See this image and copyright information in PMC

References

1. Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F. L., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 10.48550/arXiv.2303.08774 - DOI
1. Bossard L., Guillaumin M., Van Gool L. (2014). “Food-101-mining discriminative components with random forests,” in Computer vision-ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13 (Springer: New York; ), 446–461. 10.1007/978-3-319-10599-4_29 - DOI
1. Bulat A., Tzimiropoulos G. (2022). Lasp: text-to-text optimization for language-aware soft prompting of vision and language models. arXiv preprint arXiv:2210.01115. 10.1109/CVPR52729.2023.02225 - DOI
1. Cimpoi M., Maji S., Kokkinos I., Mohamed S., Vedaldi A. (2014). “Describing textures in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Piscataway, NJ: IEEE; ), 3606–3613. 10.1109/CVPR.2014.461 - DOI
1. Crowson K., Biderman S., Kornis D., Stander D., Hallahan E., Castricato L., et al. (2022). “Vqgan-clip: open domain image generation and editing with natural language guidance,” in European Conference on Computer Vision (Springer: New York; ), 88–105. 10.1007/978-3-031-19836-6_6 - DOI

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mixture of prompts learning for vision-language models

Affiliations

Mixture of prompts learning for vision-language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Research Materials