- 1University of Cambridge, Department of Engineering, Cambridge, UK
- 2Water Balance Consulting, Boulder, USA
- 3UiT the Arctic University of Norway, Department of Physics and Technology, Tromsø, Norway
- 4University of Cambridge, Department of Computer Science and Technology, Cambridge, UK
Hydrological analysis and prediction with sparse and discontinuous data remain a key challenge for water resources planning and climate adaptation, especially in large river basins across the Global South. Traditional stochastic hydrology methods and process-based models often fall short in their attempts to capture the complexity of these systems. Recent efforts to apply machine learning for river discharge imputation (assigning values to any data gaps in the target variable) and reconstruction (the inclusion of other proxy data to further inform imputation, such as climatic variables) show promise in creating complete historical datasets based on a limited set of discontinuous observations. However, these methods have not been tested on datasets from large river basins with a high proportion of missing values. Here, we address this gap and investigate the suitability of machine learning methods for streamflow imputation and reconstruction in a case study of the Nile River basin. We tested a range of common regression models, imputers (algorithms designed specifically for the purpose of estimating missing data points but with limited flexibility), and Conditional Neural Processes (CNPs, models that leverage the advantages of both deep neural networks and Gaussian Processes). We modelled 13 stations with different observational periods to fill a dataset with 53% missing values between 1900-2002. The first set of benchmarking experiments relied solely on spatio-temporal gauged streamflow data as input to the models (imputation). The second set also incorporated climate proxies from ECMWF ERA5 reanalysis data to model streamflow from 1964-2002 (reconstruction). For this, we took monthly average precipitation, temperature, relative humidity, wind speed, and soil moisture data.
Imputation experiments found random forest and gradient-boosting regressors achieving the most consistent mean and median scores of Root Mean Squared Error (RMSE), Coefficient of Determination (R2), and Nash Sutcliffe Efficiency (NSE) across all stations. Bayesian ridge regression and the CNP performed the worst on these metrics. Reconstruction experiments using the same models with the added input of climate proxies yielded similar findings, with gradient-boosting regression again outperforming the other methods. CNP found a salient improvement in metric performance by including these proxies, while regressors modelled the data less accurately. This suggests that contextual data benefit the meta-learning capabilities of the CNP, but it is too much information for the regressions to capture. CNP was the only well-performing model tested that provided uncertainty estimates for the predictions. Nearly all models achieved an average NSE>0.7 across all stations in all experiments, thus suggesting that machine learning methods can be a reliable and scalable streamflow imputation method. The approach developed in this study can be applied to other river basins with sparse observations to build more complete hydrological datasets for water resources management and planning applications.
How to cite: Billari, C. G., Girona-Mata, M., Wheeler, K., Marinoni, A., and Borgomeo, E.: Machine Learning for Reconstructing Streamflow Time Series: An Application to the Nile River, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7429, https://doi.org/10.5194/egusphere-egu25-7429, 2025.