EGU21-8363
https://doi.org/10.5194/egusphere-egu21-8363
EGU General Assembly 2021
© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Method to obtain optimum number of clusters in geo-spatial data using stability analysis of cluster centres

Dave O Leary, Eve Daly, and Colin Brown
Dave O Leary et al.
  • Earth and Ocean Sciences and Ryan Institute, College of Science and Engineering, National University of Ireland, Galway, Galway, Ireland.

Recently the availability of large geo-spatial datasets has increased. These range from soil, quaternary, and geology maps to airborne geophysical and satellite remote sensing data. Such datasets may provide a means to spatially map hydrologically relevant properties (porosity, permeability, texture, depth etc), traditionally mapped via in-situ measurements. However, such datasets, and the relationship between them, is often complex.

Clustering of multidimensional data offers a means to simplify our understanding of the relationships in the data by dividing it into groups containing similar relationships. Clustering can be used as a form of initial exploratory analysis when little is known about how the data relates to the underlying processes in its creation. Such analysis can be useful for geo-spatial datasets whereby the resulting clusters can be re-projected back to the original spatial coordinates and the spatial distribution of the clusters can be visualised.

There are various clustering algorithms available and the choice of algorithm is often as important as choosing the clustering parameters within it. This work shows a comparison between a more traditional K-Means clustering and a more modern machine learning technique known as Self-Organising Maps (SOMs). The choice of number of clusters for such an analysis is also ambiguous, requiring a priori knowledge of the result. This undermines the general idea behind unsupervised clustering, whereby the result should be driven by the data.

Here, a method is proposed to allow the choice of clusters to be dependent on a metric derived from within the cluster analysis itself. The method presented uses the variation in (dataspace) distance between each data point and its cluster centre over 100 runs, for increasing number of clusters. The number of clusters at which this variation remains low is then determined to be the natural optimum number of clusters for a particular dataset.

This method is tested on a dataset where the cluster number is already known and a real-world example. Dataset 1 is the “Rice Grain Model” with four known clusters, which the method can accurately reconstruct. The real-world dataset is a combination of airborne radiometric geophysical data and a digital elevation model over peatland in the Republic of Ireland. The method outputs three as the optimum number of clusters and the result divides the peatland into three zones (confirmed with a ground geophysics survey) which are to be used in the creation of hydrological units within Soil and Water Assessment Tool (SWAT) modelling.

How to cite: O Leary, D., Daly, E., and Brown, C.: Method to obtain optimum number of clusters in geo-spatial data using stability analysis of cluster centres, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-8363, https://doi.org/10.5194/egusphere-egu21-8363, 2021.

Corresponding displays formerly uploaded have been withdrawn.