- Department of Civil Engineering, The University of Tokyo, Tokyo, Japan (mizuki-funato@g.ecc.u-tokyo.ac.jp)
Accurate rainfall-runoff analysis is vital for flood prediction, water resources management, and climate impact assessment. While data-driven hydrological models such as Long Short-Term Memory (LSTM) networks have shown promise, developing a globally applicable framework that is accurate, interpretable, and computationally efficient remains a grand challenge, primarily because most catchments worldwide are ungauged. We address this by employing HYdrologic Prediction with multi-model Ensemble and Reservoir computing (HYPER). This hybrid method combines Bayesian Model Averaging (BMA), a multi-model ensemble, with Reservoir Computing (RC), a type of machine learning model. The framework infers model weights for ungauged basins by linking catchment attributes to the model weights learned from gauged basins. While this model has previously demonstrated higher accuracy and lower uncertainty compared to LSTMs, particularly when training data is limited, its global applicability remains unassessed. Therefore, in this study, we evaluate the global applicability of HYPER using a pseudo-ungauged approach, where gauged basins are treated as ungauged for validation. We challenge the conventional assumption that more data is better by investigating whether selecting a strategic subset of gauged basins for training outperforms using the entire available dataset. Initial experiments revealed that prediction accuracy remained robust regardless of whether 90 % or only 3 % of available basins were used for training. Furthermore, training on basins from a single, hydrologically similar region often yielded higher accuracy than training on a diverse multi-regional dataset. To identify the optimal training subset, we compared three distinct data selection methods: 1) Greedy selection, which identifies donor basins by selecting the nearest neighbors within the static catchment attribute state space; 2) Physics-Informed selection, which calculates the distance between target and candidate basins while applying heavier penalty weights to slope and aridity to strictly enforce physical similarity; and 3) Meta-Learning, which utilizes a Random Forest to learn the relationship between attribute differences and model weight correlations, subsequently predicting donor basins expected to have the highest weight correlation with the target. While all three methods outperformed the baseline of using all available data (Kling-Gupta Efficiency (KGE): 0.12), the Physics-Informed and Meta-Learning approaches achieved the highest consistency and accuracy. Even when only 5 out of 1,505 basins were used for training, these methods achieved KGE scores of 0.26 and 0.31, respectively, effectively bridging the performance gap toward fully gauged basins (KGE: 0.54). These findings demonstrate that for global prediction in ungauged regions, data quality, especially the strategic selection of training basins, is more important than data quantity, marking a step towards robust, globally applicable runoff analysis.
How to cite: Funato, M. and Sawada, Y.: Data Quality over Quantity: Optimized Data Selection for Data-driven Global Prediction in Ungauged Basins, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9189, https://doi.org/10.5194/egusphere-egu26-9189, 2026.