- 1Nanjing University, China
- 2Technical University of Munich, Germany
Vision-language foundation models (VLFMs), such as CLIP, have demonstrated remarkable generalizability across diverse downstream tasks, including both cross-modal and vision-centric tasks. Leveraging large-scale textual supervision, VLFMs capture a broad spectrum of visual concepts and achieve breakthrough performance in zero-shot image understanding. However, current remote sensing (RS)-specific VLFMs, while performing well on image-level tasks, exhibit limited capability in fine-grained tasks such as open-vocabulary semantic segmentation (OVSS). This limitation stems from their adherence to the CLIP training paradigm, which aligns image and text features only at the global level, thereby degrading performance in tasks requiring high-quality visual representations at local level. Moreover, existing VLFMs that incorporate fine-grained alignment mechanisms still exhibit limited performance on remote sensing tasks, whether through direct transfer to RS scenarios or fine-tuning on RS image-caption datasets. This further underscores the need for developing RS-tailored fine-grained VLFMs.
To address this, we construct the first multi-granularity RS image-text dataset, MGRS-200k (Figure 1). MGRS-200k contains approximately 200k RS images, each annotated with both short and long global captions, as well as multiple object-level bounding boxes with corresponding categories, totaling over one million instances. We further investigate existing fine-grained VLFM training methods and find that their explicit region-text alignment strategies often disrupt semantic coherence, as their underlying assumptions do not hold in RS scenarios, and thus degrade fine-grained understanding.
Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework (Figure 2). FarSLIP first employs patch-to-patch self-distillation to align local and global visual cues, enhancing feature discriminability while preserving semantic coherence. It then applies CLS token-based region-category alignment using the MGRS-200k dataset to further improve spatial awareness. FarSLIP achieves state-of-the-art performance in zero-shot RS image understanding, excelling not only on image-level tasks such as scene classification and image-text retrieval, but more importantly on fine-grained tasks like OVSS. Additionally, it serves as a strong foundation for multimodal large language models (MLLMs) in RS image comprehension.
Figure 1. Examples of our proposed MGRS-200k dataset.
Figure 2. Overall rchitecture of FarSLIP. The model is trained in a two-stage manner. In Stage I, FarSLIP is optimized with image-caption alignment and patch-to-patch self-distillation. In Stage II, image-caption alignment and region-category alignment are jointly employed on the MGRS-200k dataset.
How to cite: Li, Z., Zhang, X., Xiao, P., and Zhu, X.: FarSLIP: A Vision-Language Foundation Model for Fine-Grained Remote Sensing Understanding, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12871, https://doi.org/10.5194/egusphere-egu26-12871, 2026.