HS3.5

Clustering in hydrology: methods, applications and challenges

Clustering analysis is a well-known exploratory task for partitioning databases into smaller groups based on patterns or inherent similarity in data. Clustering methods have found many applications in many disciplines due to growing interest in unravelling the hidden and meaningful patterns that exist in large amounts of available data. Due to its unsupervised nature, clustering data is a complex task that requires attention to optimal choice alternatives regarding methods, model parameters and performance metrics. However, the suitability of clustering algorithms depends on their application. Different methods and approaches co-exist in a large pool. The challenge is to obtain application-specific insights while enabling a practical knowledge perspective for benchmarking. There are still research gaps in the wider clustering literature, and hydrology-specific knowledge is fragmented and difficult to find.

In hydrology, unsupervised classification of multivariate data is often used but typically in rather basic forms and as an intermediate step. Recently, the number of studies using clustering methods has rapidly increased. However, a clear and integrative vision on clustering algorithms is currently missing. Despite advances in other fields, the scope of hydrological studies is limited. Knowledge exchange on hydrology-specific ways of dealing with the issues related to clustering is needed.

The aim of this session is to explore theoretical and conceptual underpinnings of well-known clustering methods, offer fresh insights into applications of new clustering methods, gain thorough understanding of pearls and pitfalls in clustering algorithms, provide a critical overview of the main challenges associated with data clustering process, discuss major research trends and highlight open research issues. It is expected to improve scientific practice within the hydrology domain, and foster scientific debate on benchmarking in cluster analysis.

We invite contributions (case studies, comparative analyses, theoretical experiments) on a wide range of topics including (but not limited to): hard vs fuzzy clustering; comparison of clustering algorithms; benchmarking in cluster analysis; clustering as an exploratory tool vs clustering as a hypothesis testing tool; determination of number of clusters; selecting variables to cluster upon; evaluation of clustering performance; alternative clustering methods (sequential, evolutionary, deep, ensemble, etc.)

Public information:
Please join us in the first year of this new Hydroinformatics session at #vEGU21! We are looking forward to your participation!:)
Co-organized by ESSI1/NP4
Convener: Nilay Dogulu | Co-conveners: Svenja FischerECSECS, Wouter KnobenECSECS
vPICO presentations
| Thu, 29 Apr, 13:30–14:15 (CEST)
Public information:
Please join us in the first year of this new Hydroinformatics session at #vEGU21! We are looking forward to your participation!:)

Session assets

Session materials

vPICO presentations: Thu, 29 Apr

Chairpersons: Nilay Dogulu, Svenja Fischer, Wouter Knoben
Hydrological analysis
13:30–13:40
|
EGU21-375
|
solicited
|
Highlight
Manuela Irene Brunner, Reinhard Furrer, and Eric Gilleland

Grouping catchments according to their seasonal streamflow or flood behavior can be essential in regionalization studies, climate impact assessments, or model choice and evaluation. Classical clustering approaches often rely on a selection of indices derived from streamflow/flood hydrographs to identify groups of similar hydrographs and ignore valuable information provided through the temporal (auto-)correlation pattern. To exploit this temporal information, we propose a functional clustering approach to identify catchments with similar streamflow regimes or flood hydrographs. Functional data clustering expresses hydrograph shapes as continuous functions by projecting them onto a set of basis functions (here B-splines) and clusters the resulting basis coefficients using classical clustering algorithms such as hierarchical or k-means clustering.
We apply this functional clustering approach to (1) a large set of catchments in the United States in order to identify regions with similar streamflow regimes and (2) a large set of catchments in Switzerland in order to identify regions with similar flood reactivity. We show that both the streamflow regime and flood reactivity regions are not only similar in terms of their streamflow/hydrograph behavior but also in terms of physiography and climate. We use the streamflow regime clusters derived using functional data clustering to assess future streamflow regime changes in the United States and demonstrate that they are beneficial in climate impact assessments, e.g. to indicate which types of catchments are particularly prone to future change. Further, we use the flood reactivity regions in a regionalization study to derive design hydrographs in ungauged catchments. We conclude that functional clustering approaches are beneficial in climate impact assessments and regionalization studies and might potentially also be valuable to cluster other types of hydrological phenomena such as drought events or long-term streamflow behavior.

How to cite: Brunner, M. I., Furrer, R., and Gilleland, E.: Functional data clustering as a powerful tool to group streamflow regimes and flood hydrographs, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-375, https://doi.org/10.5194/egusphere-egu21-375, 2020.

13:40–13:42
|
EGU21-206
Eliot Sicaud, Jan Franssen, Jean-Pierre Dedieu, and Daniel Fortier

For remote and vast northern watersheds, hydrological data are often sparse and incomplete. Fortunately, remote sensing approaches can provide considerable information about the structural properties of watersheds, which is useful for the indirect assessment of their hydrological characteristics and behavior. Our main objective is to produce a high-resolution territorial clustering based on key hydrologic landscape metrics for the entire 42 000 km2 George River watershed (GRW), located in Nunavik, northern Québec (Canada). This project is being conducted in partnership with the local Inuit communities of the GRW for the purpose of generating and sharing knowledge to anticipate the impact of climate and socio-environmental change in the GRW.

Our clustering approach employs Unsupervised Geographic Object-Based Image Analysis (GeOBIA) applied to the entire GRW with the subwatersheds as our objects of analysis. The landscape metric datasets used to generate the input variables of our GeOBIA classification are raster layers with a 30m x 30m pixel resolution. Topographic metrics are derived from a Digital Elevation Model (DEM) and include elevation, slopes, aspect, drainage density and watershed elongation. Land cover spectral metrics comprised in our analysis are the Normalized Difference Vegetation Index (NDVI), the Normalized Difference Moisture Index (NDMI) (Gao, 1996) and the Normalized Difference Water Index (NDWI) (McFeeters, 1996), which are all computed from a Landsat-8 cloud-free surface reflectance mosaic dating from 2015. Rasterized maps of surface deposit distribution and permafrost distribution, both produced by the Ministère des Forêts, de la Faune et des Parcs of Québec (MFFP), respectively constitute the surface and subsurface metrics of our GeOBIA.

The clustering algorithm used in this Unsupervised GeOBIA is the Fuzzy C-Means (FCM) algorithm. The FCM algorithm provides the objects a set of membership coefficients corresponding to each cluster. The greatest membership coefficient is then used to attribute the distinct subwatersheds to a cluster of watersheds with similar hydro-geomorphometric characteristics. The classification returns a Fuzzy Partition Coefficient (FPC), which describes how well-partitioned our dataset is. The FPC can vary greatly depending on the number of clusters we want to produce. Thus, we find the optimal number of clusters by maximizing the FPC.

Preliminary clustering results, computed only with topographic and land cover metrics, have identified two distinct watershed classes/clusters. In general, “Type 1” subwatersheds are clustered over the southern and northwestern portion of the GRW and are characterized by low to moderate elevation, high vegetation cover, high moisture and high surface water cover. Whereas “Type 2” subwatersheds located over the northeastern portion of the GRW are characterized by high elevation, low vegetation cover, low moisture and low surface water cover. These results will be refined with the use of additional metrics and will provide the detailed understanding necessary to assess how the hydrological regime of the river and its tributaries will respond to climate change, and how landscape change and human activities (e.g., planned mining development) may impact the water quality of the George River and its tributaries.

How to cite: Sicaud, E., Franssen, J., Dedieu, J.-P., and Fortier, D.: Clustering analysis for the hydro-geomorphometric characterization of the George River watershed (Nunavik, Canada), EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-206, https://doi.org/10.5194/egusphere-egu21-206, 2020.

13:42–13:44
|
EGU21-8363
|
ECS
|
Highlight
Dave O Leary, Eve Daly, and Colin Brown

Recently the availability of large geo-spatial datasets has increased. These range from soil, quaternary, and geology maps to airborne geophysical and satellite remote sensing data. Such datasets may provide a means to spatially map hydrologically relevant properties (porosity, permeability, texture, depth etc), traditionally mapped via in-situ measurements. However, such datasets, and the relationship between them, is often complex.

Clustering of multidimensional data offers a means to simplify our understanding of the relationships in the data by dividing it into groups containing similar relationships. Clustering can be used as a form of initial exploratory analysis when little is known about how the data relates to the underlying processes in its creation. Such analysis can be useful for geo-spatial datasets whereby the resulting clusters can be re-projected back to the original spatial coordinates and the spatial distribution of the clusters can be visualised.

There are various clustering algorithms available and the choice of algorithm is often as important as choosing the clustering parameters within it. This work shows a comparison between a more traditional K-Means clustering and a more modern machine learning technique known as Self-Organising Maps (SOMs). The choice of number of clusters for such an analysis is also ambiguous, requiring a priori knowledge of the result. This undermines the general idea behind unsupervised clustering, whereby the result should be driven by the data.

Here, a method is proposed to allow the choice of clusters to be dependent on a metric derived from within the cluster analysis itself. The method presented uses the variation in (dataspace) distance between each data point and its cluster centre over 100 runs, for increasing number of clusters. The number of clusters at which this variation remains low is then determined to be the natural optimum number of clusters for a particular dataset.

This method is tested on a dataset where the cluster number is already known and a real-world example. Dataset 1 is the “Rice Grain Model” with four known clusters, which the method can accurately reconstruct. The real-world dataset is a combination of airborne radiometric geophysical data and a digital elevation model over peatland in the Republic of Ireland. The method outputs three as the optimum number of clusters and the result divides the peatland into three zones (confirmed with a ground geophysics survey) which are to be used in the creation of hydrological units within Soil and Water Assessment Tool (SWAT) modelling.

How to cite: O Leary, D., Daly, E., and Brown, C.: Method to obtain optimum number of clusters in geo-spatial data using stability analysis of cluster centres, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-8363, https://doi.org/10.5194/egusphere-egu21-8363, 2021.

13:44–13:46
|
EGU21-7952
|
ECS
Laura Torres-Rojas, Noemi Vergopolan, Jonathan D. Herman, and Nathaniel W. Chaney

The representation of land surface’s sub-grid heterogeneity in Earth System models remains a persistent challenge. The evolution of grid-cell partitioning techniques has evolved from user-defined equally sized tiles (Chen et al., 1997) to structural partition techniques based on vegetation or soil spatial distribution (Melton & Arora, 2014), and finally, to advanced clustering techniques, based on the concept of Hydrological Response Units (HRU) (Chaney et al., 2018). These sub-grid tiling schemes for Land Surface Models (LSM) have emerged as efficient and effective options to represent sub-grid heterogeneity. However, such approaches rely on an arbitrarily-defined number of tiles per macroscale grid cell with no assurance of a robust representation of heterogeneity. To address this challenge, we introduce a physically coherent approach that uses a Random Forest Model (RFM) to precompute the optimal tile configuration per macro-grid cell. An RFM is trained on a set of environmental covariates, their spatial organization features over the modeling domain (i.e., correlation lengths), and hydrological target-variables errors of several model outputs.

We assemble and run the HydroBlocks LSM for 100 tiles’ configurations for 100 domains of 0.5x0.5-degree resolution in the Contiguous United States (CONUS). The tiles’ configuration is defined by two clustering algorithm parameters and one height discretization one. From this parameter combination, 10,000 simulations emerged. For each simulation, we compiled the spatial standard deviation of specific hydrological target-variables and evaluated the tiles’ configuration convergence by comparing various multi-objective optimization methodologies to determine the optimal compromise solutions on each study domain. Preliminary results show that as the number of tiles increases, the hydrological fluxes and states converge toward stable conditions. With the optimal parameter combination set for each domain and information on the environmental characteristics, an RFM is trained to predict the optimal cluster configuration. Using this approach, we demonstrate how a reduced-order model can effectively compute a priori the appropriate tile complexity based solely on environmental characteristics.

References

Chaney, N. W. el al. (2018). Harnessing big data to rethink land heterogeneity in Earth system models. Hydrology and Earth System Sciences, 22(6), 3311–3330. https://doi.org/10.5194/hess-22-3311-2018

Chen, T. H. et al. (1997). Cabauw experimental results from the Project for Intercomparison of Land-Surface Parameterization Schemes. Journal of Climate, 10(6), 1194–1215. https://doi.org/10.1175/1520-0442(1997)010<1194:CERFTP>2.0.CO;2

Melton, J. R., & Arora, V. K. (2014). Sub-grid scale representation of vegetation in global land surface schemes: implications for estimation of the terrestrial carbon sink. Biogeosciences, 11, 1021–1036. https://doi.org/10.5194/bg-11-1021-2014

How to cite: Torres-Rojas, L., Vergopolan, N., Herman, J. D., and Chaney, N. W.: Leveraging unsupervised learning for optimizing the number of sub-grid tiles for land surface modeling over the Contiguous United States, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-7952, https://doi.org/10.5194/egusphere-egu21-7952, 2021.

Precipitation analysis
13:46–13:48
|
EGU21-13843
|
ECS
Gabriela Urgilés, Rolando Célleri, Katja Trachte, Jörg Bendix, and Johanna Orellana-Alvear

Information about the temporal rainfall variability at high-resolution is scarce, especially in regions with complex topography as the Tropical Andes, and this hinders the study rainfall dynamics. The identification of rainfall types is usually determined using thresholds of some rainfall characteristics as rain rate and velocity. Nevertheless, these thresholds are identified for a specific study area and thus they cannot be extrapolated to other places to identify rainfall classes. Thus, the aim of this study is to investigate rainfall-event classes based on a clustering approach by using the k-means algorithm. The clustering analysis is used to group objects (i.e., rainfall-events) based on its characteristics (e.g., duration, intensity, drop size distribution, melting layer identification). This study was carried out using data retrieved from a vertically pointing Micro Rain Radar (MRR) and a laser disdrometer. The instruments were located in the tropical Andes, at 2600 m a.s.l., in the city of Cuenca, Ecuador.  Three years of data were available for the study. Firstly, the rainfall events were selected by using the criteria: minimum inter-event, minimum total accumulation and minimum duration. Then, by using the k-means algorithm, two principal rainfall classes were identified in the study area. These rainfall classes (i.e., convective, stratiform) showed marked differences in their rainfall characteristics. Besides, a third rainfall class (mixed class) was identified as a subclass of the stratiform class. The stratiform class was more common during the year in the study area. Also, short duration rainfall events (less than 70 min) were dominant. Furthermore, the melting layer characteristic – that is used to determine rainfall classes – did not influence the rainfall class identification using the clustering analysis, especially in two classes; thus, its prior study is not necessary, and this makes the clustering analysis highly beneficial. Finally, this clustering analysis ensured an objective separation of rainfall classes in the tropical high Andes. This rainfall classification provided new insights about the rainfall dynamics in this tropical mountain area.

How to cite: Urgilés, G., Célleri, R., Trachte, K., Bendix, J., and Orellana-Alvear, J.: An objective separation of rainfall classes in the high tropical Andes by using a clustering analysis., EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-13843, https://doi.org/10.5194/egusphere-egu21-13843, 2021.

13:48–13:50
|
EGU21-14758
|
ECS
Konstantinos Vantas and Epaminondas Sidiropoulos

Rainfall time series analysis using clustering involves the identification of temporal patterns, with each data item representing an individual storm. This analysis results in clusters of data items that trend in a common way and can be utilized in stochastic simulation, water resources planning and the identification of future directions due to climate change. A comparative analysis is carried out of several methods that use intra versus inter-cluster distances, for the estimation of the relevant number of clusters using a big dataset of the described rainfall time series. Visualization using topographic maps that are produced via nonlinear projection techniques is applied, to validate the presence of both distance and density structures and to assist in the final determination of the numbers of clusters. This stands in contrast to empirical and not completely data-driven approaches of the literature, in which constrained clustering methods are employed with assumptions on the presence of four classes.

How to cite: Vantas, K. and Sidiropoulos, E.: Knowledge discovery using clustering analysis of rainfall timeseries, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-14758, https://doi.org/10.5194/egusphere-egu21-14758, 2021.

13:50–13:52
|
EGU21-9090
|
ECS
Sabrine Derouiche, Cécile Mallet, Zoubeida Bargaoui, and Abdelwahab Hannachi

The use of artificial neural networks in problems related to water resources, hydrology and meteorology has received steadily increasing interest over the last decade or so. In this study, the methodology proposed to analyse rainfall features and to investigate the relationships with global climate change is based on  the use of Self-Organizing Map (SOM) and presents a generic character.

As a first step, daily winter precipitation of northern Tunisia, collected between 1960-2009 over 70 rain gauge stations, are transformed into separate events. This separation is based on the determination of the minimun inter-event time (dry interval) between two independent and consecutive rain events. Six rainfall event features (i.e., average rain event accululation, average event duration, seasonnal accumulation, number of rainy day…) are thus extracted for each of the (70 stations x 50 winter seasons).

In the second step, SOM is applied to analyse the six rainfall features. The SOM is an unsupervised learning algorithm, used as a technique vector quantization, allowing the modeling of probability density functions. It divides the set of multidimensional data (vectors of six features in our case) into clusters. As in k-means, rainfall stations and years with similar characteristics are grouped in a cluster represented by its centroid point named referent. SOM enables moreover the projection of high-dimensional data onto a low dimensional (usually two-dimensional) discrete lattice of neurons as an output layer (map space). The structure of the neurons in the map and the cost function used for its training, ensure that neighboring neurons in the map space are associated with neighboring referents in the initial space. This conservation of the topology allows the analysis of multidimensional nonlinear relationships between the six selected descriptors by visualizing their projection in the map space.

For a better representation of the input dataset a 16×20 neurons map is used. But a such number may complicate the synthesis of some spatial or temporal specificities. So, this large number of neurons is aggregated into a smaller number of clusters. For that an hierarchical agglomerative clustering (HAC)  is applied in the third step. This hierachical process is initiated by accepting each neuron as a separate cluster. Then, at each stage of the algorithm, similar clusters, using Ward distance, are combined in pairs.

The fourth step allows to determine the final number of clusters by using visually-based method known as data image. This consists of mapping the dissimilarity matrix of the referents into an image framework where each pixel reflects the magnitude of each value. Here rows and columns can be reordered based on hierarchical clustering of the referents The blocs observed along the diagonal of each image represents the clusters.

Finaly the northern Tunisia winter precipitation are classified into four rainfall situations from the driest to the wettest while also taking into account the rainfall day frequency during the season and rainfall event types. The projection of external climatic variables on the map will make it possible to analyse the links between the four observed rain regimes and the global climate.

How to cite: Derouiche, S., Mallet, C., Bargaoui, Z., and Hannachi, A.: Statistical analysis of rainfall event features using the Self Organizing Map with application to Northern Tunisia, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-9090, https://doi.org/10.5194/egusphere-egu21-9090, 2021.

13:52–13:54
|
EGU21-12378
|
ECS
Abbas El Hachem, András Bárdossy, Jochen Seidel, Golbarg Goshtsasbpour, and Uwe Haberlandt

Precipitation extremes are a space-time variant. Understanding how they vary from one location to another is an essential information for identifying spatially homogeneous and heterogeneous areas. By identifying the boundaries of these areas, a better characterization of the underlying spatial behavior is possible. Intensity-duration-frequency (IDF) curves are a mathematical function that relates the rainfall intensity with its duration and frequency of occurrence. The clustering approach is helpful for finding homogeneous regions for grouping stations (or radar cells) to estimate regional cumulative distribution functions (CDF) or regional IDF curves. This offers a new possibility to include the spatial aspect of rainfall extremes. For this purpose, CDF and IDF curves were calculated from the observed rainfall data at the rain gauges of the German weather service network.   Data from almost 5000 stations with daily resolution and 1000 stations with higher temporal resolution were used. The Kolmogorov–Smirnov (KS) test, a statistical nonparametric test was used to compare the similarity (or dissimilarity) between the distribution functions. Eventually a KS distance matrix was obtained and a multidimensional scaling analysis along a K-means clustering algorithm was performed. As a main result, similar and dissimilar regions within the stations were identified.

How to cite: El Hachem, A., Bárdossy, A., Seidel, J., Goshtsasbpour, G., and Haberlandt, U.: Clustering CDF and IDF curves of rainfall extremes, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-12378, https://doi.org/10.5194/egusphere-egu21-12378, 2021.

13:54–13:56
|
EGU21-12026
|
ECS
|
Highlight
Philomène Le Gall, Pauline Rivoire, Anne-Catherine Favre, Philippe Naveau, and Olivia Romppainen-Martius

Extreme precipitation often cause floods and lead to important societal and economical damages. Rainfall is subject to local orography features and their intensities can be highly variable. In this context, identifying climatically coherent regions for extremes is paramount to understand and analyze rainfall at the correct spatial scale. We assume that the region of interest can be partitioned into homogeneous regions. In other words, sub-regions with common marginal distribution except a scale factor. As an example, considering extremes as block maxima or excesses over a threshold, a sub-region corresponds to a constant shape parameter. We develop a non-parametric clustering algorithm based on a ratio of Probability Weighted Moments to identify these homogeneous regions and gather weather stations. By construction this ratio does not depend on the location and scale parameters for the Generalized Extreme Value and Generalized Pareto distributions. Our method has the advantage to only rely on raw precipitation data and not on station covariates.

A simulation data study is performed based on the extended GPD distribution that appears to well capture low, moderate and heavy rainfall intensities. Sensitivity to the number of clusters is analyzed. Results of simulation reveal that the method detects homogeneous regions. We apply our clustering algorithm on ERA-5 precipitation over Europe. We obtain coherent homogeneous regions consistent with local orography. The marginal precipitation behaviour is analyzed through regional fitting of an extended GPD.

How to cite: Le Gall, P., Rivoire, P., Favre, A.-C., Naveau, P., and Romppainen-Martius, O.: Spatial clustering of heavy clustering in ERA-5 precipitation over Europe}, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-12026, https://doi.org/10.5194/egusphere-egu21-12026, 2021.

13:56–14:15