From Disparate Datasets to Analysis-Ready Data Cubes with Pangeo on EarthCODE

Krasen Samardzhiev; Deyan Samardzhiev; Anca Anghelea; Ewelina Dobrowolska

doi:https://doi.org/10.5194/egusphere-egu26-21395

[Back] [Session ESSI2.3]

EGU26-21395, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-21395

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

From Disparate Datasets to Analysis-Ready Data Cubes with Pangeo on EarthCODE

Krasen Samardzhiev¹, Deyan Samardzhiev¹, Anca Anghelea², and Ewelina Dobrowolska³

Krasen Samardzhiev et al.

¹Lampata, (krasen@lampata.co.uk)
²ESA
³Serco

The EarthCODE Open Science Catalog (https://opensciencedata.esa.int/catalog) contains over 300 data products at this moment, most of them the result of peer-reviewed scientific research. Currently, these exist as disparate individual datasets, mostly grouped under themes or variables. This fragmentation creates a barrier to interoperability, where a scientist has to manually combine these datasets—for example reprojecting, regridding, or temporally resampling heterogeneous data.

EarthCODE is creating a new category of products-combined data cubes for each of the Open Science Catalog’s themes-to streamline access for science researchers and ensure the data is truly "Analysis-Ready" (ARD). Combining the data products into a single grid and a single projection will drastically reduce researcher overhead needed to harmonize the appropriate datasets. This workflow focuses on the combination of different datasets and collaborating with scientists to curate the appropriate data and to minimise disruption during the transformation process, since any reprojection or regridding introduces uncertainties.

We demonstrate the efficacy of this Pangeo-aligned workflow through the Antarctica InSync project (https://discourse-earthcode.eox.at/t/antartica-insync-data-cubes/107). This was a multi-stage pipeline that included close collaboration with the scientific community. The first step was aggregating the relevant Antarctic datasets. This step by itself is important, since it centralizes domain knowledge and ensures the Open Science Catalog contains the latest datasets relevant to the research community.

The second step involved processing the data using cloud-native tools to convert it to the same projection, common grid, and in some cases the same resolution (creating coherent STAC Collections). The third step involved the generation of detailed metadata at the variable level for all datasets to ensure high Findability and Reusability. Furthermore, we also provide the visualisation tools to explore the data cube via cloud-optimized formats, without downloading it, in addition to a discussion forum. To foster open science and reproducibility, our accompanying library will contain all generalizable functions that were used to generate this data, allowing the community to reuse these workflows for other domains.

How to cite: Samardzhiev, K., Samardzhiev, D., Anghelea, A., and Dobrowolska, E.: From Disparate Datasets to Analysis-Ready Data Cubes with Pangeo on EarthCODE, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21395, https://doi.org/10.5194/egusphere-egu26-21395, 2026.