- King Abdullah University of Science and Technology, Environmental Science and Engineering, Saudi Arabia (omar.lopezcamargo@kaust.edu.sa)
Fractional Vegetation Cover (FVC) is a key ecological variable for monitoring ecosystem health, land degradation, and vegetation dynamics in dryland environments. While satellite and UAV observations enable scalable FVC estimation over large spatial extents, the accuracy and robustness of these models remain strongly dependent on high-quality field-based reference data for calibration and validation. Traditional in-situ methods, including visual estimates using transect-based surveys, remain widely used but are labor-intensive and inherently subjective. Digital photography has emerged as a practical alternative, typically analyzed using index-based computer vision techniques or deep learning models. However, these methods are highly sensitive to background variability and therefore rely on massive labeled datasets. Recent advances in multimodal large language models (MLLMs) suggest a potential paradigm shift, as these models combine visual perception with high-level reasoning and benefit from diverse pre-training that enables conceptual knowledge transfer across tasks. In this study, we evaluate the feasibility of using MLLMs for direct estimation of FVC from ground-level photographs without task-specific training. We collected and compiled a dataset of more than 1,100 quadrat pictures from across 26 dryland sites in Saudi Arabia, spanning a wide range of surface conditions from bare soil to sparsely vegetated rangelands. Each picture corresponded to a 1 m × 1 m quadrat with FVC estimated independently by two experts, whose average was used as reference data for assessment of model predictions. Six state-of-the-art multimodal large language models, including Qwen2.5-VL, Mistral-Small-3.2, LLaMA-4-Maverick, LLaMA-4-Scout, and two Gemma-3 variants, were evaluated using four prompt designs that varied in length, ecological context, and methodological detail. Across all models and prompts, MLLMs achieved a mean absolute error of approximately 7.8%, demonstrating competitive performance relative to traditional image-based methods. The best-performing model-prompt combinations achieved mean absolute error values below 5%, with low systematic bias. Short and ecologically explicit prompts consistently outperformed more complex prompt designs, achieving an average reduction in mean absolute error (MAE) of approximately 1.3–1.4 percentage points compared to visually guided or highly structured prompts (MAE ≈ 6.9% versus 8.2–8.4%). Overall performance was more sensitive to model choice than to prompt structure, with mean MAE varying from approximately 5.6% to 10.0% across models, compared to a narrower range across prompts. The highest accuracy was obtained using the Qwen2.5-VL model with an ecologically detailed prompt, which achieved a mean absolute error of 4.9%, near-zero bias, and an RMSE of 8.4%. Across all prompt designs, Qwen2.5-VL and Mistral-Small-3.2 consistently delivered the best overall performance, both maintaining mean MAE values below 6% and exhibiting stable behavior across prompt variations, indicating robustness to prompt design. These results demonstrate that MLLMs can provide accurate and scalable FVC estimates directly from field photographs, without requiring specialized training datasets. This approach offers a promising alternative for rapid field surveys and reference data generation, particularly in dryland ecosystems where background complexity and data scarcity limit the effectiveness of conventional methods.
How to cite: Lopez Camargo, O. A., Elias Lara, M., El Hajj, M., Cheng, H., Scilla, D., Angulo, V., Al wahas, A., Johansen, K., and McCabe, M. F.: Evaluating Fractional Vegetation Cover using Multimodal Large Language Models: A Comparative study with Human Observations, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18939, https://doi.org/10.5194/egusphere-egu26-18939, 2026.