Bridging Ground and Satellite Views with Contrastive Learning for Scalable Habitat Monitoring

Theo Larcher; Alexis Joly; Joseph Salmon; Pierre Bonnet; Marijn Van Der Velde

doi:https://doi.org/10.5194/wbf2026-760

[Back] [Session IND11]

WBF2026-760, updated on 10 Mar 2026

https://doi.org/10.5194/wbf2026-760

World Biodiversity Forum 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Oral | Wednesday, 17 Jun, 09:30–09:45 (CEST)| Room Sanada 1

Bridging Ground and Satellite Views with Contrastive Learning for Scalable Habitat Monitoring

Theo Larcher^1,3,4, Alexis Joly³, Joseph Salmon³, Pierre Bonnet², and Marijn Van Der Velde⁴

Theo Larcher et al.

¹IMAG, Université de Montpellier, Herault, France (theo.larcher@inria.fr)
²CIRAD (AMAP), Herault, France
³INRIA, Herault, France
⁴European commission (JRC), Ispra, Italy

Monitoring species and habitat distributions across space and time is critical for biodiversity conservation, as it allows ecologists and decision-makers to assess ecosystem dynamics, detect emerging threats, and prioritize interventions to mitigate biodiversity loss.

Species and Habitat Distribution Modelling (SDM/HDM) enable this by identifying correlations between environmental variables and species or habitat occurrences. In particular, Deep Neural Networks for Habitat Distribution Modelling (Deep-HDM) has demonstrated strong scalability and cross-modal learning capabilities across images, tabular data, and text.

However, most Deep-HDM approaches rely on mono-scale only data and overlook the potential of extracting complementary information from ground-level images, which encode medium-grained ecological and structural landscape cues absent from remote sensing or tabular data alone. To address this gap, we propose a multi-scale, image-focused habitat classification pipeline that jointly leverages satellite/remote sensing observations and landscape photographs.

Our method uses pre-trained modality-specific visual encoders (e.g., GeoCLIP, SwinV2, ResNet) to generate initial representations, which are then refined using contrastive learning to spatially align features from geographically close samples. A downstream habitat classifier is then finally trained on this shared representation space, allowing to infer habitats from multiple possible input data types.

To carry out our experiments, we rely on three European vegetation and land-cover datasets: (i) LUCAS (EUROSTAT) landscape images of which a minority has level-2 EUNIS habitat labels (~70k samples, ~15k survey sites); (ii) EVA (European Vegetation Survey) containing presence-absence plant observations with level-1-to-3 EUNIS habitat labels (~500k survey sites); and (iii) EMBAL (European Commission's DG ENV) which includes images of transects, where a minority has level-2 EUNIS habitat labels (~75k samples, ~4k survey sites). The code will be open-sourced in the future, and details about accessing the datasets will be provided.

Results show that contrastive spatial pre-training improves Deep-HDM performance, particularly for fine-grained habitat identification. This demonstrates that learning shared representations over multi-scale input data strengthens habitat prediction compared to mono-scale baselines. Better habitat classification from multi-modal data can improve habitat monitoring, but also the spatial delineation of habitats across Europe, and thus help regional targeting of more pertinent measures under the Nature Restoration Regulation.

How to cite: Larcher, T., Joly, A., Salmon, J., Bonnet, P., and Van Der Velde, M.: Bridging Ground and Satellite Views with Contrastive Learning for Scalable Habitat Monitoring, World Biodiversity Forum 2026, Davos, Switzerland, 14–19 Jun 2026, WBF2026-760, https://doi.org/10.5194/wbf2026-760, 2026.