EGU23-5256, updated on 22 Feb 2023
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Caravan - A global community dataset for large-sample hydrology

Frederik Kratzert1, Grey Nearing2, Nans Addor3,4, Tyler Erickson5, Martin Gauch1,6, Oren Gilon7, Lukas Gudmundsson8, Avinatan Hassidim7, Daniel Klotz6, Sella Nevo7, Guy Shalev7, and Yossi Matias7
Frederik Kratzert et al.
  • 1Google Research, Vienna, Austria (
  • 2Google Research, Mountain View, CA, United States
  • 3Fathom, Square Works, Bristol, UK
  • 4Geography, University of Exeter, Exeter, UK
  • 5Google, Mountain View, CA, USA
  • 6Institute for Machine Learning, Johannes Kepler University, Linz, Austria
  • 7Google Research, Tel Aviv, Israel
  • 8Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland

High-quality datasets are essential to support hydrological science and modeling. Several datasets exist for specific countries or regions (e.g. the various CAMELS datasets). However, these datasets lack standardization, which makes global studies difficult. Additionally, creating large-sample datasets is a time and resource consuming task, often preventing the release of data that would otherwise be open. Caravan (as in “a series of camels”) is an initiative that tries to solve both of these problems by creating an open data processing environment in the cloud for the community to use.

Caravan is a globally consistent and open dataset

Caravan leverages globally available data sources that are published under an open license to derive meteorological forcings and attributes for any catchment. We use ERA5-Land for meteorological forcings and hydrological reference states (SWE and four levels of soil moisture) and HydroATLAS for the catchment attributes. Currently, Caravan consists of 6830 gauges with daily streamflow data (median record length ~30 years), 9 meteorological variables (from 1981 - 2020) in different daily aggregations, 4 hydrological reference states, and a total of 221 catchment attributes.

Caravan is derived entirely in the cloud

All meteorological time series (and hydrological reference states) from ERA5-Land are processed on Google Earth Engine, which removes the burden of downloading and processing large amounts of raw gridded data. Similarly, all catchment attributes are computed on Earth Engine. The code used to derive Caravan is publicly available ( . Once you have streamflow records and the corresponding catchment polygons, deriving all other data (forcing data and attributes) is a matter of a few hours of actual work. Depending on the number of catchments, their size and spatial distribution, that are being processed at once on Earth Engine , it might take a day or two for Earth Engine to extract meteorological data and catchment attributes. 

Most importantly: Caravan is a community project

Even though the existing data in Caravan has good coverage over most climate zones, the spatial coverage is still patchy. Here is where we see Caravan as a community effort. Given the provided code, everybody with access to streamflow data and the authorisation to redistribute it can create a Caravan extension with minimal effort and share the extension with the community, thus contributing to a dynamically growing dataset. A full step-by-step tutorial is available at We envision that, with many people participating, this will result in a truly global and spatially consistent, large-sample hydrology dataset. A first Caravan extension was already published by Julian Koch (, which increased the number of gauges to 7138, by adding 308 gauges in Denmark.

How to cite: Kratzert, F., Nearing, G., Addor, N., Erickson, T., Gauch, M., Gilon, O., Gudmundsson, L., Hassidim, A., Klotz, D., Nevo, S., Shalev, G., and Matias, Y.: Caravan - A global community dataset for large-sample hydrology, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-5256,, 2023.