Enhancing waste recognition with vision-language models: A prompt engineering approach for a scalable solution
- PMID: 40513414
- DOI: 10.1016/j.wasman.2025.114939
Enhancing waste recognition with vision-language models: A prompt engineering approach for a scalable solution
Abstract
Conventional unimodal computer vision models, trained on limited bespoke waste datasets, face significant challenges in classifying waste images in material recovery facilities, where waste appears in diverse forms. Maintaining performance of these models requires frequent fine-tuning with extensive data augmentation, which is resource-intensive, time-consuming and ultimately unsustainable for scalable applications. This study implements multimodal image classification by using state-of-the-art Vision-Language Models (VLMs), showcasing their adaptability in data-scarce scenarios with zero-shot classification and their scalability through few-shot and fully supervised learning. A trade-off between accuracy and inference speed is sought as a criterion for selecting optimal model. It is demonstrated that targeted prompt engineering can significantly enhance VLM adaptability and scalability, as evidenced by the increase in zero-shot waste image classification accuracy of the optimal model from 82.71% to 90.48% without task specific finetuning. Additionally, experimenting with the optimal model resulted in an impressive fully supervised classification accuracy of 97.18%. This research contributes to the waste management literature by revealing the potential of VLMs as a scalable solution for waste image classification and by introducing a targeted prompt engineering method to enhance model performance as a final-stage optimization strategy.
Keywords: Computer vision; Deep learning; Prompt engineering; Sustainable waste management; Vision-language model; Zero-shot learning.
Copyright © 2025 The Authors. Published by Elsevier Ltd.. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025. Front Oncol. 2025. PMID: 40606969 Free PMC article.
-
Mixture of prompts learning for vision-language models.Front Artif Intell. 2025 Jun 10;8:1580973. doi: 10.3389/frai.2025.1580973. eCollection 2025. Front Artif Intell. 2025. PMID: 40556640 Free PMC article.
-
VLM-CPL: Consensus Pseudo-Labels from Vision-Language Models for Annotation-Free Pathological Image Classification.IEEE Trans Med Imaging. 2025 Aug 4;PP. doi: 10.1109/TMI.2025.3595111. Online ahead of print. IEEE Trans Med Imaging. 2025. PMID: 40758498
-
Intravenous magnesium sulphate and sotalol for prevention of atrial fibrillation after coronary artery bypass surgery: a systematic review and economic evaluation.Health Technol Assess. 2008 Jun;12(28):iii-iv, ix-95. doi: 10.3310/hta12280. Health Technol Assess. 2008. PMID: 18547499
-
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3. Cochrane Database Syst Rev. 2022. PMID: 35593186 Free PMC article.
Cited by
-
Autonomous Waste Classification Using Multi-Agent Systems and Blockchain: A Low-Cost Intelligent Approach.Sensors (Basel). 2025 Jul 12;25(14):4364. doi: 10.3390/s25144364. Sensors (Basel). 2025. PMID: 40732493 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources