- Karlsruhe Institute of Technology (KIT) , Institute of Water and Environment, Karlsruhe, Germany
Deep learning-based computer vision methods are widely used to detect and quantify floating macroplastic litter in rivers, enabling accurate assessments of plastic pollution by automatically processing images and videos. However, these methods typically rely on large amounts of annotated data for supervised learning (SL), and the manual labeling work is costly and time-consuming. This hinders broad model generalization, a key requirement for robust computer vision systems for long-term and large-scale litter monitoring.
To overcome this challenge, we propose a Vision-Language Model (VLM)-based method for detecting floating litter, without labeled images for model training. Recent advances in Generative AI, particularly VLMs, have revolutionized artificial intelligence by enabling rich semantic understanding across modalities. Pre-trained on millions to billions of image-text pairs, VLMs effectively learn visual representations from the natural language supervision, thereby enabling robust cross-modal understanding and generalization. This broad pre-training also allows VLMs to achieve remarkable zero-shot generalization performances in many domain-specific applications, even without domain-specific labeled images for SL.
We demonstrate the effectiveness of our methodology using multiple VLMs (e.g., DeepSeek-VL2 and OpenCLIP) on images collected from canals and waterways in the Netherlands and South East Asia. We conduct a comprehensive comparison with conventional SL approaches using multiple deep learning architectures (e.g., Vision Transformer, ResNet, and DenseNet). The results indicate that our method achieves robust zero-shot generalization performance.
Based on these results, we suggest stakeholders (e.g., researchers, consultants and governmental organizations) to consider VLM-based methods to develop robust systems for targeted long-term floating litter monitoring, while minimizing the cost of collecting labeled data.
How to cite: Zhang, C., Jia, T., J. Franca, M., Lofty, J., Rebai, D., and Ehret, U.: Vision-Language Models for Floating Litter Detection, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-8285, https://doi.org/10.5194/egusphere-egu26-8285, 2026.