EGU24-10278, updated on 02 Sep 2024
https://doi.org/10.5194/egusphere-egu24-10278
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Assessing the area of applicability of spatial prediction models through a local data point density approach

Fabian Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
Fabian Schumacher et al.
  • Institute for Geoinformatics, University of Münster, Münster, Germany (fabian.schumacher@uni-muenster.de)

Machine learning is frequently used in the field of earth and environmental sciences to produce spatial or spatio-temporal predictions of environmental variables based on limited field samples - increasingly even on a global scale and far beyond the extent of available training data. Since new geographic space often goes along with new environmental properties, the spatial applicability and transferability of models is often questionable. Predictions should be constrained to environments that exhibit properties the model has been enabled to learn.

Meyer and Pebesma (2021) have made a first proposal to estimate the area of applicability (AOA) of spatial prediction models. Their method is based on distances - in the predictor space - of the prediction data point to the nearest reference data point to derive a dissimilarity Index (DI). Prediction locations with a DI larger than DI values observed through cross-validation during model training are considered outside of the AOA. As a consequence, the AOA is defined as the area where the model has been enabled to learn about relationships between predictors and target variables and where, on average, the cross-validation performance applies. The method, however, is only based on the distance - in the predictor space - to the nearest reference data point. Hence, a single data point in an environment may define a model as “applicable” in this environment. Here we suggest extending this approach by considering the densitiy of reference data points in the predictor space, as we assume that this is highly decisive for the prediction quality.

We suggest extending the methodology with a newly developed local data point density (LPD) approach based on the given concepts of the original method to allow for a better assessment of the applicability of a model. The LPD is a quantitative measure for a new data point that indicates how many similar (in terms of predictor values) reference data points have been included in the model training, assuming a positive relationship between LPD values and prediction performance. A reference data point is considered similar if it defines a new data point as being within the AOA, i.e. the model is considered applicable for the corresponding prediction location. We implemented the LPD approach in the R package CAST. Here we explain the method and show its applicability in simulation studies as well as real-world applications.

Reference:

Meyer, H; Pebesma, E. 2021. ‘Predicting into unknown space? Estimating the area of applicability of spatial prediction models.’ Methods in Ecology and Evolution 12: 1620–1633. doi: 10.1111/2041-210X.13650.

How to cite: Schumacher, F., Knoth, C., Ludwig, M., and Meyer, H.: Assessing the area of applicability of spatial prediction models through a local data point density approach, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-10278, https://doi.org/10.5194/egusphere-egu24-10278, 2024.