Potentials and Limitations of Vision-Language Models for Large-Scale 3D Semantic Mapping in Agricultural Environments

Tjark Schütte; Sascha Kontetzki; Thomas Hänel

doi:https://doi.org/10.5194/egusphere-egu26-21351

[Back] [Session BG9.9]

EGU26-21351, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-21351

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Potentials and Limitations of Vision-Language Models for Large-Scale 3D Semantic Mapping in Agricultural Environments

Tjark Schütte¹, Sascha Kontetzki², and Thomas Hänel²

Tjark Schütte et al.

¹Technische Universität Berlin, Institute for Machine Design and System Technology, Agromechatronik, Germany (t.schuette@tu-berlin.de)
²Universität Osnabrück, Joint Lab Künstliche Intelligenz & Data Science

Vision–language models (VLMs) are increasingly used for the semantic interpretation of visual data, enabling flexible, open-vocabulary analysis of images based on natural language descriptions. These capabilities offer new opportunities for large-scale semantic mapping, particularly in domains where comprehensive labeled training data are scarce or difficult to obtain, such as agricultural and horticultural environments.

Recent research has explored the transfer of semantic information from 2D imagery into three-dimensional representations, a process often called semantic lifting. This approach is attractive for outdoor scene understanding, as training native 3D vision–language models that generalize across landscapes and management regimes remains challenging and tools for 3D data are therefore not developed as far as in the 2D domain. However, most existing studies on semantic lifting focus on indoor environments or urban outdoor scenes, while agricultural landscapes—with their distinct structural characteristics, vegetation dynamics, and management patterns—remain underexplored.

In this contribution, we investigate the applicability of open-vocabulary, VLM-based semantic lifting for large-scale 3D semantic mapping in agricultural settings. Building on insights from urban-scale benchmarks, we analyze how vision–language-driven semantic segmentation transfers to outdoor agricultural and horticultural scenes reconstructed from multi-view UAV imagery. Our results highlight both the potential of these models to generate spatially consistent semantic representations and their limitations, which are strongly dependent on land cover type and semantic classes.

We discuss how such preliminary semantic 3D representations can support large-scale agroecosystem mapping and serve as an initial layer for downstream applications, including spatial analysis and the deployment of agricultural robotic systems. The findings provide guidance on the opportunities and current constraints of foundation-model-based semantic mapping for sustainable agricultural monitoring.

How to cite: Schütte, T., Kontetzki, S., and Hänel, T.: Potentials and Limitations of Vision-Language Models for Large-Scale 3D Semantic Mapping in Agricultural Environments, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21351, https://doi.org/10.5194/egusphere-egu26-21351, 2026.