Mapping (un)certainty of machine learningbased spatial prediction models based on predictor space distances
 ^{1}University of Münster, Institute of Landscape Ecology, Münster, Germany (hanna.meyer@unimuenster.de)
 ^{2}University of Münster, Institute for Geoinformatics, Münster, Germany (edzer.pebesma@unimuenster.de)
Spatial mapping is an important task in environmental science to reveal spatial patterns and changes of the environment. In this context predictive modelling using flexible machine learning algorithms has become very popular. However, looking at the diversity of modelled (global) maps of environmental variables, there might be increasingly the impression that machine learning is a magic tool to map everything. Recently, the reliability of such maps have been increasingly questioned, calling for a reliable quantification of uncertainties.
Though spatial (cross)validation allows giving a general error estimate for the predictions, models are usually applied to make predictions for a much larger area or might even be transferred to make predictions for an area where they were not trained on. But by making predictions on heterogeneous landscapes, there will be areas that feature environmental properties that have not been observed in the training data and hence not learned by the algorithm. This is problematic as most machine learning algorithms are weak in extrapolations and can only make reliable predictions for environments with conditions the model has knowledge about. Hence predictions for environmental conditions that differ significantly from the training data have to be considered as uncertain.
To approach this problem, we suggest a measure of uncertainty that allows identifying locations where predictions should be regarded with care. The proposed uncertainty measure is based on distances to the training data in the multidimensional predictor variable space. However, distances are not equally relevant within the feature space but some variables are more important than others in the machine learning model and hence are mainly responsible for prediction patterns. Therefore, we weight the distances by the modelderived importance of the predictors.
As a case study we use a simulated areawide response variable for Europe, bioclimatic variables as predictors, as well as simulated field samples. Random Forest is applied as algorithm to predict the simulated response. The model is then used to make predictions for entire Europe. We then calculate the corresponding uncertainty and compare it to the areawide true prediction error. The results show that the uncertainty map reflects the patterns in the true error very well and considerably outperforms ensemblebased standard deviations of predictions as indicator for uncertainty.
The resulting map of uncertainty gives valuable insights into spatial patterns of prediction uncertainty which is important when the predictions are used as a baseline for decision making or subsequent environmental modelling. Hence, we suggest that a map of distancebased uncertainty should be given in addition to prediction maps.
How to cite: Meyer, H. and Pebesma, E.: Mapping (un)certainty of machine learningbased spatial prediction models based on predictor space distances, EGU General Assembly 2020, Online, 4–8 May 2020, EGU20208492, https://doi.org/10.5194/egusphereegu20208492, 2020
Comments on the presentation
AC: Author Comment  CC: Community Comment  Report abuse

CC1:
Question, Robert Runya, 05 May 2020

AC1:
Reply to CC1, Hanna Meyer, 05 May 2020
Thanks, Robert, for the feedback!
"What would be the recommended sample size for the data to train and valiidate the mdel when using machine learning models?"
Unfortunately, there is no universal recommendation for this. That depends on the complexity of the problem.
"What would be the best ML model to use if i was to apply them to predict subtrate types from acoustic data. "
Also no universal recommendation here I'm afraid. I usually test a few like Random Forests, Support Vector Machines, GBM,... There was never really a clear "winner". But I find that the choice of the algorithm is secondary in most cases (at least among the just mentioned). It's more important to find suitable predictors for the response variable.

CC2:
Reply to AC1, Robert Runya, 05 May 2020
Thanks Hanna for the response

CC2:
Reply to AC1, Robert Runya, 05 May 2020

AC1:
Reply to CC1, Hanna Meyer, 05 May 2020

CC3:
Comment on EGU20208492, Alexandre Wadoux, 06 May 2020
Dear Hanna,
Sorry in advance for my long comment, I find this work very interesting and I was planning on doing something similar.
You assumed that the distance in the multidimensional (weighted by importance) covariate space is representative of the dissimilarity, is this realistic? Intuitively this seems correct, but could it happen that a point far away (in the feature space) is better predicted that a point close to the bulk of the data? Maybe the distance does not matter, but rather the fact that yes or no the multidimensional space is covered. In RF you have some abrupt changes in the values predicted due to the splitting of the tree, so it may well happen that a far away point in the feature space is actually better predicted than a close point! See my comment below, but I also think that all this does not apply equally to all ML algorithms.
Seeing the Figure 3 in your presentation triggers a lot of interest. It makes me think of a hypercube that I like to use to define sampling locations for mapping.
Some thoughts:
 Have you thought about fitting a convex hull around those point in the multidimensional feature space? I imagine that the distance from the border of this hull would be a better indicator than the distance from the nearest point. I feel there might be something better than simply taking the distance from the closest point. Imagine there is one single point to cover a very large area of the multivariate space, is it the same than having a cluster of points (in the multivariate space) to define an area? Maybe taking the distance from the center of all points in the multivariate space is more realistic.
 Have you thought about drawing multivariate strata (as in latin hypercube) and to consider all the uncovered multivariate strata as potential areas where RF is likely to extrapolate? Intuitively this is what I would do!
I would also suggest you to read the Filzmoser et al., (2005) paper as it might be very relevant to your work. Filzmoser et al., (2005) (and others before him) showed that, if it can be assumed that the covariates follow a multivariate normal distribution (is this a realistic assumption in your case?) with a certain mean and covariance matrix, then the squared Mahalanobis distances follow a chi square distribution with m degree of freedom (m is the number of covariates). If you take a quantile of the chisquare distribution (e.g. 99% quantile) as a cutoff value, you can define values that are likely to not belong to this distribution if they exceed this cutoff value. Filzmoser et al., (2005) provide an analytical solution for this cutoff value, so no arbitrary decision. In you case, you use the Euclidean distance in the standardized space of the covariates, which is equal to the Mahalanobis distance if I am right. I guess this multivariate outlier detection technique could well be applied to your case: if a new point exceeds the cutoff value, then this point is too far away (in the multivariate space) to the bulk of the data and is likely to be poorly predicted!
Filzmoser, P., Garrett, R. G., & Reimann, C. (2005). Multivariate outlier detection in exploration geochemistry. Computers & geosciences, 31(5), 579587.
Last comment: You often use the term ‘machine learning’ but in your case almost always use random forest (considering this work and your last publications). While parts of this work apply largely to machine learning (e.g. your Fig. 1), we are likely to see different results if we apply a different ML algorithm (SVM or ANN). RF, SVM and ANN (and also genetic programming) are all very different in the way they are calibrated. ANN is calibrated by minimizing an objective function, while SVM is calibrated by maximizing the margins. RF uses a splitting criterion. So I think treating all these algorithms to having the same characteristics is misleading. See Section 4.6 in my paper:
Wadoux, A. M. C., Brus, D. J., & Heuvelink, G. B. (2019). Sampling design optimization for soil mapping with random forest. Geoderma, 355, 113913.
Anyway, great work and I am looking forward to read the paper!
Alexandre

AC3:
Reply to CC3, Hanna Meyer, 06 May 2020
Thanks, Alexandre, for these great comments!
Yes, I agree: high differences between training data and new data do not necessarily lead to a high prediction error. See the figure below: As moving away from the last training data point (x= 1.7), the value of the “dissimilarity Index” (DI) increases (b). However, the error does not necessarily increases in the same way (comparing the predictions with the truth in a). But locations with a high dissimilarity are associated with a high uncertainty because the environment, and hence the prediction success, is unknown. So we simply don’t know if the predictions for this area are correct or not and therefore should not apply the model there.“Have you thought about fitting a convex hull around those point in the multidimensional feature space?”
The problem with the convex hull is, that we cannot account for gaps in the predictor space, only for space out of the range of the data. The same applies to the idea of taking the distance from the center of all points in the multivariate space (which could in theory be located in a large gap not covered by any training data).
“I feel there might be something better than simply taking the distance from the closest point.”
We were thinking about data point densities as an alternative. But this is still work in progress… Also taking into account distances to more than just a single point might be an option, or your multivariate strata suggestion. So we see our project as a first attempt here, and it contains a number of aspects that are still up for discussion, continuation and potential collaboration :)Thanks for the Filzmoser et al., (2005) suggestion. I’ll go into it but I need more time to think it through.
Your last question was if the approach can be generalized across algorithms. We developed it using Random Forests (because it’s the most frequently applied one) but we are confident it should apply to other algorithms as well. Even though their calibration is very different, most algorithms used to fit complex relationships cause problems when we predict beyond the range of training data (or into gaps in the predictor space) as well. We ran a few smaller tests with neural networks and support vector machines and it looked very fitting. But more systematic tests are still needed here, I agree.
Best
Hanna
AC4:
Reply to AC3, Edzer Pebesma, 06 May 2020
Hi Alexandre, good questions. An alternative we have been thinking about is to use density estimates of the validation data in feature space; nevertheless, this also would require a distance measure and doesn't aleviate the curse of dimensionality. The advantage would be that it might better reveal how much information there is in the neighbourhood, not just that there is at least one point at this distance.

AC4:
Reply to AC3, Edzer Pebesma, 06 May 2020

CC4:
Re: Variable Importante, Benedikt Knüsel, 06 May 2020
Dear Dr. Meyer,
This is a very interesting proposal, I like it a lot!
I was wondring about the weighting according to variable importance. Couldn't it be that the variable importance changes for areas that are dissimilar to your training dataset? If so, wouldn't it be safer also consider estimate an AOA for (normalized) unweighted variables?
Best,
Benedikt Knüsel

AC2:
Reply to CC4, Hanna Meyer, 06 May 2020
Thanks, Benedikt.
"I was wondring about the weighting according to variable importance. Couldn't it be that the variable importance changes for areas that are dissimilar to your training dataset? If so, wouldn't it be safer also consider estimate an AOA for (normalized) unweighted variables?"
Yes, probably the importance of variables would indeed change if we had a chance to estimate it based on the entire area of interest. But because our model is based on the training data, the variable importance ranking of the model is also based on the training data only. So the prediction patterns are mainly driven by the variables with highest importance based on the training set. And that doesn't change outside the AOA. So we will need to treat distances for the important variables differently compared to those that are not relevant in the model. In the extreme case, a predictor is not used in a model at all, so it won't matter if a new data point has very different values for this predictor because it has no effect for the prediction.
But our method as implemented in the Rpackage "CAST" also allows for unweighted variables. We would use this option especially prior to model training. E.g. to locate areas where additional sampling effort is required to increase the AOA.

AC2:
Reply to CC4, Hanna Meyer, 06 May 2020

CC1:
Comment on EGU20208492, András Zlinszky, 05 May 2020
Dear Dr. Meyer,
have you considered comparing the prediction certainty metric you propose with other random forest prediction certainty metrics? Several such metrics have been proposed, based on the prediction probabilities, without investigating and weighing the input variables. Some examples: probability surplus by Zlinszky & Kania (ISPRS 2016), maximum probability by Immitzer et al, confusion index (Burrough 1997 Geoderma), probability entropy index (Maselli 1994 ISPRS). Is there an added value in going back to the input variables and their importance, compared to just looking at the prediction certainties? Probability surplus has also been proposed for allowing identification of areas where the prediction should be treated with criticism, and for directed collection of new training information from such locations (active learning).
Best regards,
András Zlinszky

AC1:
Reply to CC1, Hanna Meyer, 05 May 2020
Thanks András for starting the discussion!
Yes, we compared the approach to other uncertainty methods. As we worked with regression models here, we didn’t look at prediction probabilities but at standard deviations of predictions made by individual trees/ensembles (e.g. by 500 trees of a random forest model).
For the case study shown here you see the results for standard deviations of predictions in Figure 5c. As you can see, it doesn’t fit well to the true prediction error (Figure 5d). This is not surprising because as shown in Figure 1b, when applying Random Forest to make predictions beyond the range of observed predictor values, each tree will make quite similar predictions (which will be comparable to the last known training point), leading to low standard deviations.
In our example this is especially obvious in the Alps or the west coast of Norway where the environments are very different to environments covered by training data (Figure 4). In these areas the prediction error is very high, which is nicely reflected by our suggested “Dissimilarity Index”. In contrast, it can not be reflected by standard deviations of predictions, which are, for the reason mentioned above, very low for these locations.
So yes, there is an added value in going back to the input variables. They give very different information compared to e.g. standard deviations of predictions or prediction probabilities, by reflecting the missing knowledge about environments. And when we make predictions for new areas we need to make sure that our model has knowledge about such environments, otherwise we should better not apply it there.
Best
Hanna

AC1:
Reply to CC1, Hanna Meyer, 05 May 2020
Hello,
This is great, i like the method you have proposed in this work and indeed i could borrow this approach to my work. What would be the recommended sample size for the data to train and valiidate the mdel when using machine learning models? What would be the best ML model to use if i was to apply them to predict subtrate types from acoustic data.
Thanks