Optimizing Vision-Language Model for Robust Road Damage Assessment via Parameter-Efficient Fine-Tuning

Donghwi Kim; Heejung Youn

doi:https://doi.org/10.5194/egusphere-egu26-3674

[Back] [Session ITS1.6/ESSI1.6]

EGU26-3674, updated on 13 Mar 2026

https://doi.org/10.5194/egusphere-egu26-3674

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Optimizing Vision-Language Model for Robust Road Damage Assessment via Parameter-Efficient Fine-Tuning

Donghwi Kim and Heejung Youn

Hongik Unviersity, Civil and Environmental Engineering, Korea, Republic of (dhkim141@hongik.ac.kr)

This study investigates the domain adaptation of the Vision-Language Model (VLM) for road damage assessment, focusing on a fine-tuning strategy optimized for resource-constrained engineering environments. Unlike conventional object detection models that operate within fixed label spaces, VLMs provide superior semantic understanding and generalization in complex scenarios. To facilitate practical deployment, this research systematically analyzes key variables of Parameter-Efficient Fine-Tuning (PEFT) to mitigate the high computational demands inherent in large-scale VLMs.

In the experimental phase, hyperparameter tuning was conducted using the Low-Rank Adaptation (LoRA) technique. The primary variables included LoRA ranks (16, 32, 64, and 96), training data scale, and image resolutions (1,024ⅹ28ⅹ28 vs. 1,536ⅹ28ⅹ28). A comprehensive dataset of 26,796 images comprising six damage categories and negative samples was established, utilizing a 7n sampling strategy (n=500, 750, 1,000) to address class imbalance. The impact of data volume was evaluated by augmenting the 7,000-sample set (corresponding to n=1,000) to match the full dataset size of 26,796, with zero-shot inference serving as the performance baseline.

Experimental results demonstrated substantial improvements over zero-shot inference, indicating that performance positively correlates with increased data scale with augmentation and higher image resolution, while lower LoRA ranks (16, 32) proved most effective for this domain. Furthermore, the introduction of specialized ad-hoc metrics, M_mAP and M_F1, verified a stable trade-off between False Positives and False Negatives. Notably, to minimize safety-critical False Negatives, a prompt engineering-based 'Double Check' mechanism and multi-turn interactions were utilized. This approach successfully leveraged the model’s inherent reasoning capabilities to refine damage identification through iterative feedback.

Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2025-25437298)

How to cite: Kim, D. and Youn, H.: Optimizing Vision-Language Model for Robust Road Damage Assessment via Parameter-Efficient Fine-Tuning, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3674, https://doi.org/10.5194/egusphere-egu26-3674, 2026.