Effect of merging large datasets on prediction accuracy of low flow estimation by random forest
- University of Natural Resources and Life Sciences, Institute of Statistics, Department of Landscape, Spatial and Infrastructure Sciences, Austria (johannes.laimighofer@boku.ac.at)
Low flow estimation is a crucial part in water management. Prediction of low flow in ungauged basins is often performed through statistical models. This can be either regionalization approaches, where homogeneous regions are used for modeling, or single model frameworks that range from simple linear models to more complex as random forest, support vector regression or deep learning approaches. Although there are large sample studies for the US (e.g. Tyralis et al. 2021) or Australia (e.g. Worland et al. 2018), we are not aware of a study that combines different large datasets and analyzing the effect on prediction accuracy. We are hypothesing that the heterogeneity of many datasets together can improve prediction accuracy for tree-based models relative to linear models. Hence, we propose to combine several similar datasets and analyze the effect on prediction accuracy for estimating Q95 by a simple random forest model.
Our study uses four large hydrological datasets – CAMELS-GB (Coxon et al. 2020), CAMELS-US (Addor et al. 2017), CAMELS-AUS (Fowler et al. 2021) and LamaH-CE (Klinger et al., 2021). We are applying a random forest model to ensure that interactions and non-linearity can be captured. Prediction accuracy is evaluated by leave one out cross-validation (LOOCV) and several performance metrics, e.g. median absolute error (MDAE), or root mean squared error (RMSE). LOOCV is used for each individual dataset and in one run for the merged dataset. Results indicate that merging datasets can improve prediction accuracy, but models fail to correctly predict low flows around zero.
References
- Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, https://doi.org/10.5194/hess-21-5293-2017, 2017.
- Fowler, K. J. A., Acharya, S. C., Addor, N., Chou, C., and Peel, M. C.: CAMELS-AUS: hydrometeorological time series and landscape attributes for 222 catchments in Australia, Earth Syst. Sci. Data, 13, 3847–3867, https://doi.org/10.5194/essd-13-3847-2021, 2021.
- Coxon, G., Addor, N., Bloomfield, J. P., Freer, J., Fry, M., Hannaford, J., Howden, N. J. K., Lane, R., Lewis, M., Robinson, E. L., Wagener, T., and Woods, R.: CAMELS-GB: hydrometeorological time series and landscape attributes for 671 catchments in Great Britain, Earth Syst. Sci. Data, 12, 2459–2483, https://doi.org/10.5194/essd-12-2459-2020, 2020.
- Klingler, C., Schulz, K., and Herrnegger, M.: LamaH-CE: LArge-SaMple DAta for Hydrology and Environmental Sciences for Central Europe, Earth Syst. Sci. Data, 13, 4529–4565, https://doi.org/10.5194/essd-13-4529-2021, 2021.
- Tyralis, H.; Papacharalampous, G.; Langousis, A.; Papalexiou, S.M. Explanation and Probabilistic Prediction of Hydrological Signatures with Statistical Boosting Algorithms. Remote Sens. 2021, 13, 333. https://doi.org/10.3390/rs13030333
- Worland, S. C., Farmer, W. H., and Kiang, J. E.: Improving predictions of hydrological low-flow indices in ungaged basins using machinelearning, Environmental modelling & software, 101, 169–182, https://doi.org/10.1016/j.envsoft.2017.12.021, 2018.
How to cite: Laimighofer, J. and Laaha, G.: Effect of merging large datasets on prediction accuracy of low flow estimation by random forest, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7312, https://doi.org/10.5194/egusphere-egu22-7312, 2022.