The impact of benchmark selection on spatial patterns of model evaluation metrics

Paul Coderre; Wouter Knoben; Cyril Thébault; Nicolas Vásquez; Martyn Clark; Alain Pietroniro

doi:https://doi.org/10.5194/egusphere-egu26-15343

[Back] [Session HS2.2.7]

EGU26-15343, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-15343

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

The impact of benchmark selection on spatial patterns of model evaluation metrics

Paul Coderre, Wouter Knoben, Cyril Thébault, Nicolas Vásquez, Martyn Clark, and Alain Pietroniro

Paul Coderre et al.

University of Calgary, Schulich School of Engineering, Civil Engineering, Canada

Hydrological model evaluation is often performed with aggregated metrics such as the widely used Nash Sutcliffe Efficiency (NSE). The NSE is a skill score that can be interpreted as using the mean observed flow as a benchmark against which to compare model performance. However, this results in strong spatial patterns of scores that conflate model skill with flow variability, depending on how appropriate the benchmark model is for the catchment at hand. These patterns make it difficult to compare NSE scores across catchments which complicates model evaluation and comparison. This work addresses this limitation by using alternative formulations of the NSE that replace the mean observed flow term with various other benchmark simulations (called “benchmark efficiencies”, BME). BME values were calculated for an ensemble of 20 simple benchmarks, using hydrological model simulations from 960 basins in North America as a test case. The benchmarks vary from simple statistics calculated directly from the streamflow series to extremely simple models that try to capture the main outcomes of catchment behavior.

Results show that alternative benchmarks show spatial patterns of model performance that differ from those of the NSE, due to differences in how well the individual benchmarks capture flow variability in different regions. Benchmarks that effectively capture flow variability in a given catchment result in a low BME score and are a more challenging test of model performance. As such, selecting the lowest BME score in each catchment can reduce the spatial patterns in model scores by ensuring that the model is always being compared to the benchmark that best captures the flow variability of the catchment. The highest NSE scores were all found in catchments with strongly seasonal flow regimes, but the highest BME scores came from a more even distribution of flow regimes. Indeed, several catchments with strongly seasonal flow regimes had NSE scores above 0.5 with negative corresponding BME scores. This indicates that failing to use appropriate benchmarks for BME calculations in catchments with a strongly seasonal flow regime can mask the fact that the model cannot beat simple benchmarks and may provide an overly optimistic assessment of model performance. By selecting the most appropriate benchmark in each basin from a larger benchmark ensemble, the resulting spatial overview of model performance found through the BME approach is less conflated with flow variability. This results in BME values that are more strongly focused on the added value of using the model over alternative ways to predict the variable of interest. This strongly affects the conclusions one might draw about where a model is fit-for-purpose, and where improvements in model performance may be most readily achieved.

How to cite: Coderre, P., Knoben, W., Thébault, C., Vásquez, N., Clark, M., and Pietroniro, A.: The impact of benchmark selection on spatial patterns of model evaluation metrics, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15343, https://doi.org/10.5194/egusphere-egu26-15343, 2026.

OSPP voting tool

This contribution takes part in the OSPP contest. Please log in to see the relevant judging section.