- 1Karlsruhe Institute of Technology (KIT), Institute of Water and Environment, Karlsruhe, Germany
- 2Zhejiang Institute of Hydraulics and Estuary (Zhejiang Institute of Marine Planning and Design), Hangzhou 310020, China
Accurate streamflow prediction is essential for reliable water resource management and flood forecasting. Recently, deep learning methods, especially Long Short-Term Memory (LSTM), have demonstrated state-of-the-art performance for streamflow prediction when trained in supervised learning (SL) settings. However, robust SL requires large volumes of “labeled” training data, including meteorological inputs paired with corresponding streamflow observations as ground truth. Globally, this poses a problem as only a small fraction of catchments worldwide are monitored with stream gauges. This leaves most regions with abundant “unlabeled” meteorological data but limited 'labels', i.e. discharge observations. This data scarcity limits SL model performance in data-scarce regions, and also limits model generalization and transferability.
To overcome this challenge, we propose a two-stage semi-supervised learning (SSL) method for streamflow prediction based on the Contrastive Predictive Coding (CPC) approach [1]. CPC is a self-supervised learning method that extracts informative feature representations from sequential data (e.g., meteorological time series) without labeled targets (e.g., streamflow observations), by contrasting correct future predictions against incorrect ones. In the first stage, we use CPC to pre-train an encoder (i.e., fully connected layers) and an LSTM network followed by a projection head (i.e., a linear layer without bias), using a large amount of meteorological data (28 years). In the second stage, we add a linear layer to the pre-trained encoder and LSTM, and fine-tune the model for streamflow prediction using a small amount of meteorological data paired with streamflow observations (1 year).
We demonstrate the effectiveness and robustness of our methodology on the CAMELS-DE dataset [2]. We conduct a thorough comparison with a baseline supervised learning model with the same LSTM network. The results suggest that our method improves both in-sample and out-of-sample generalization performances over the SL method, when only a limited amount of discharge data is available. Additionally, the results demonstrate that transfer learning via CPC pre-training provides informative representations for streamflow prediction task, enabling faster convergence and higher model training efficiency, compared to the baseline model trained from scratch.
Our findings highlight a promising direction to leverage self-supervising learning methods for developing hydrological foundation models. Foundation models have revolutionized artificial intelligence applications across diverse domains, and hold large promise for hydrological applications. By scaling our proposed approach with larger and more diverse datasets, we can make significant strides towards multiple downstream prediction tasks, including predicting climate-driven variables (e.g., discharge, groundwater, and soil moisture).
References:
[1] Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[2] Loritz, R., Dolich, A., Acuña Espinoza, E., Ebeling, P., Guse, B., Götte, J., ... & Tarasova, L. (2024). CAMELS-DE: hydro-meteorological time series and attributes for 1555 catchments in Germany. Earth System Science Data Discussions, 2024, 1-30.
How to cite: Jia, T., Chen, G., and Ehret, U.: Semi-Supervised Deep Learning for Streamflow Prediction in Data-scarce Regions, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-8108, https://doi.org/10.5194/egusphere-egu26-8108, 2026.