- 1Institute of Meteorology and Climate Research ‐ Atmospheric Trace Gases and Remote Sensing, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany (benedikt.heudorfer@kit.edu)
- 2Institute for Water and Environment, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
With increasingly large samples being used in deep learning hydrology, the computational cost of training models such as Long Short-Term Memory (LSTM) networks raises fundamental questions about how much (and which type of) data are actually needed. This study investigates the information content of different training data subsets and their impact on predictive skill. To do so, we systematically train LSTM models on progressively larger subsamples of the CAMELS-US dataset, using 11 different ablation/subsampling strategies that emphasize different parts of the training data, associated with different hydrological regimes, statistical representativity, temporal context, and spatial coverage. We then evaluate LSTM performance gains as a function of subsample size.
As training data volume increases, performance gains saturate more or less rapidly depending on the specific strategy tested. Random sampling emerges as the most robust and efficient strategy, achieving strong predictive skill (NSE > 0.7) with roughly 10% of the available data, illustrating high representativity of the full dataset. Temporal ablations reveal that surprisingly short input sequences (≈ 2 weeks) and limited historical records (≈ 2 years) suffice for competitive performance (NSE > 0.7), highlighting the value of including much shorter time series into datasets like CAMELS than previously assumed valuable. In contrast, although high-flow conditions have been shown in literature to be particularly information-rich, exclusively training on extremes underperforms compared to above-mentioned ablation strategies in our setup. Likewise, we show that spatial subsampling substantially limits generalized performance, underscoring the importance of spatial hydro-climatic diversity.
Overall, the results demonstrate that training efficiency in data-driven hydrology is governed more by data representativity than by targeted selection of e.g. specific event types. These findings provide practical guidance for cost-effective model development, pre-training, and experimental design in large-sample hydrologic deep learning.
How to cite: Heudorfer, B. and Loritz, R.: Is smart sampling helping to train more efficent deep learning model in Hydrology?, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-16673, https://doi.org/10.5194/egusphere-egu26-16673, 2026.