- 1INRIA, Montpellier, France
- 2Université de Montpellier, Montpellier, France
- 3INRAE, Montpellier, France
An important part of ecology is understanding where species occur in relation to their environment, a question addressed through Species Distribution Modelling (SDM). These models provide spatially explicit estimates of species distributions, which are essential for understanding biodiversity at large scales. SDMs extend far beyond what direct observation can capture and help identify which sites should be prioritised for monitoring or conservation.
SDMs are typically supported by two types of data: Presence–Absence (PA) and Presence-Only (PO). PA data, collected through structured surveys, provide high-quality information but are costly and therefore sparse. PO data, by contrast, come from opportunistic observations and contain only positive detections, making them abundant but biased. These datasets present a clear trade-off between quality and quantity, and data-integration approaches seek to combine PA and PO information so that models can exploit the complementary strengths of both.
It is not yet clear how much integration can improve our models, particularly when the available PA training data are geographically or environmentally distant from the prediction region. This issue is especially relevant for under-sampled areas, where PO data may be more readily available. Previous studies have explored this problem using simulations, but these lack the complexity and noise of real ecosystems. Using real data is therefore essential for understanding how these methods behave in practice and for assessing their value in biodiversity modelling.
Our work studies this question using real-world datasets. We propose a partitioning framework that creates controlled spatial separation between PA training and prediction regions while respecting the environmental constraints of each dataset. Using this framework, we evaluate several state-of-the-art SDM approaches, including deep neural network models adapted to support integrated PA–PO training through specialised loss functions and data-fusion strategies.
Our preliminary findings suggest that integrating PA and PO data consistently improves model performance across a range of spatial separation scenarios. This indicates that both PA-rich and PO-rich contexts can benefit from incorporating the complementary data source, highlighting data integration as a robust and broadly effective strategy for enhancing SDM generalisation and improving biodiversity assessment at scale.
How to cite: Ubilla Pavez, P., Marcos, D., Botella, C., Joly, A., Benerradi, R., and Marti, R.: Data Integration for Species Distribution Models Under Spatial and Environmental Separation, World Biodiversity Forum 2026, Davos, Switzerland, 14–19 Jun 2026, WBF2026-666, https://doi.org/10.5194/wbf2026-666, 2026.