ESSI2.13 | Big Data in Earth System Sciences: The Challenges of Data Compression & Data Spaces in the cloud computing era
EDI
Big Data in Earth System Sciences: The Challenges of Data Compression & Data Spaces in the cloud computing era
Co-organized by AS5/CL5/GD10/GI2/NP4
Convener: Clément BouvierECSECS | Co-conveners: William Ray, Mattia Santoro, Juniper TyreeECSECS, Weronika Borejko, Oriol TintoECSECS, Sara Faghih-NainiECSECS
Orals
| Thu, 01 May, 16:15–18:00 (CEST)
 
Room -2.92
Posters on site
| Attendance Thu, 01 May, 10:45–12:30 (CEST) | Display Thu, 01 May, 08:30–12:30
 
Hall X4
Orals |
Thu, 16:15
Thu, 10:45
Recent Earth System Sciences (ESS) datasets, such as those resulting from high-resolution numerical modelling, have increased both in terms of precision and size. These datasets are central to the advancement of ESS for the benefit of all stakeholders, and public policymaking on climate change. Extracting the full value from these datasets requires novel approaches to access, process, and share data. It is apparent that datasets produced by state-of-the-art applications are becoming so large that even current high-capacity data infrastructures are incapable of storing, let alone ensuring their usability. With future investment in hardware being limited, a viable way forward is to explore the possibilities of data compression and new data space implementation.

Data compression has gained interest for making data more manageable, speeding up transfer times, and reducing resource needs without affecting the quality of scientific analyses. Reproducing recent ML and forecasting results has become essential for developing new methods in operational settings. At the same time, replicability is a major concern for ESS and downstream applications and the necessary data accuracy needs further investigation. Research on data reduction and prediction interpretability helps improve understanding of data relationships and prediction stability.

In addition, new data spaces are being developed in Europe, such as the Copernicus Data Space Ecosystem and Green Deal Data Space, as well as multiple national data spaces. These provide access to data, through streamlined access, cloud processing and online visualization generating actionable knowledge enabling more effective decision-making. Analysis ready data can easily be accessed via API transforming data access and processing scalability. Developers and users will share opportunities and challenges of designing and using data spaces for research and industry.

This session connects developers and users of ESS big data, discussing how to facilitate the sharing, integration, and compression of these datasets, focusing on:
1) Approaches and techniques to enhance shareability of high-volume ESS datasets: data compression, novel data space implementation and evolution.
2) The effect of reduced data on the quality of scientific analyses.
3) Ongoing efforts to build data spaces and connect with existing initiatives on data sharing and processing, and examples of innovative services that can be built upon data spaces.

Orals: Thu, 1 May | Room -2.92

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
Chairpersons: Clément Bouvier, William Ray, Juniper Tyree
16:15–16:20
16:20–16:30
|
EGU25-4155
|
solicited
|
On-site presentation
Wolfgang Wagner, Matthias Schramm, Martin Schobben, Christoph Reimer, and Christian Briese

One of the most time-consuming and cumbersome tasks in Earth observation data science is finding, accessing and pre-processing geoscientific data generated by satellites, ground-based networks, and Earth system models. While the much increased availability of free and open Earth observation datasets has made this task easier in principle, scientific standards have evolved according to data availability, now emphasizing research that integrates multiple data sources, analyses longer time series, and covers larger study areas. As a result of this “rebound effect”, scientists and students may find themselves spending even more of their time on data handling and management than in the past. Fortunately, cloud platform services such as Google Earth Engine can save significant time and effort. However, until recently, there were no standardized methods for users to interact with these platforms, meaning that code written for one service could not easily be transferred to another (Schramm et al., 2021). This created a dilemma for many geoscientists: should they use proprietary cloud platforms to save time and resources at the risk of lock-in effects, or rely on publicly-funded collaborative scientific infrastructures, which require more effort for data handling? In this contribution, we argue that this dilemma is about to become obsolete thanks to rapid advancements in open source tools that allow building open, reproducible, and scalable workflows. These tools facilitate access to and integration of data from various platforms and data spaces, paving the way for the “Web of FAIR data and services” as envisioned by the European Open Science Cloud (Burgelman, 2021). We will illustrate this through distributed workflows that connect Austrian infrastructures with European platforms like the Copernicus Data Space Ecosystem and the DestinE Data Lake (Wagner et al., 2023). These workflows can be built using Pangeo-supported software libraries such as Dask, Jupyter, Xarray, or Zarr (Reimer et al., 2023). Beyond advancing scientific research, these workflows are also valuable assets for university education and training. For instance, at TU Wien, Jupyter notebooks are increasingly used in exercises involving Earth observation and climate data, and as templates for student projects and theses. Building on these educational resources, we are working on an Earth Observation Data Science Cookbook to be published on the Project Pythia website, a hub for education and training in the geoscientific Python community.

References

Burgelman (2021) Politics and Open Science: How the European Open Science Cloud Became Reality (the Untold Story). Data Intelligence 3, 5–19. https://doi.org/10.1162/dint_a_00069

Reimer et al. (2023) Multi-cloud processing with Dask: Demonstrating the capabilities of DestinE Data Lake (DEDL), Conference on Big Data from Space (BiDS’23), Vienna, Austria. https://doi.org/0.2760/46796

Schramm et al. (2021) The openEO API–Harmonising the Use of Earth Observation Cloud Services Using Virtual Data Cube Functionalities. Remote Sensing 13, 1125. https://doi.org/10.3390/rs13061125

Wagner et al. (2023) Federating scientific infrastructure and services for cross-domain applications of Earth observation and climate data, Conference on Big Data from Space (BiDS’23), Vienna, Austria. https://doi.org/10.34726/5309

How to cite: Wagner, W., Schramm, M., Schobben, M., Reimer, C., and Briese, C.: How open software, data and platforms are transforming Earth observation data science, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4155, https://doi.org/10.5194/egusphere-egu25-4155, 2025.

16:30–16:40
|
EGU25-17326
|
On-site presentation
Philip Kershaw, Rhys Evans, Fede Moscato, Dave Poulter, Alex Manning, Jen Bulpett, Ed Williamson, John Remedios, Alastair Graham, Daniel Tipping, and Piotr Zaborowski

The EO DataHub is a new national data space which has been under development as part of a two-year pathfinder programme to facilitate the greater exploitation of EO data for UK industry, public sector and academia. The project has been led by the UK National Centre for Earth Observation partnered with public sector bodies, the UK Space Agency, Met Office, Satellite Applications Catapult and National Physical Laboratory and enlisting commercial suppliers for the development and delivery of the software.

The Hub joins a crowded space in this sector as it joins a growing number of similar such platforms. However, as a national platform (with government as an anchor tenant) it is seeking to provide a unique offering as a trusted source of data, integrating curated data products from the science community building on UK strengths in climate research.

The architecture can be considered as a three layer model. At the base layer, different data sources are integrated - both commercial (Airbus and Planet Labs) and academic providers - from the CEDA data archive (https://archive.ceda.ac.uk) hosted on the JASMIN supercomputer (https://jasmin.ac.uk). The data catalogue now includes high and very high resolution SAR and optical products, Sentinel, UK Climate Projections, CMIP (https://wcrp-cmip.org), CORDEX (https://cordex.org) and outputs from EOCIS (https://eocis.org) consisting of a range of satellite-derived climate data products.

The middle layer, the Hub Platform provides services and APIs including federated search which integrates the data from the various providers, image visualisation, a workflow engine, user workspaces and interactive analysis environments. These build on the work of ESA's EOEPCA (https://eoepca.org) and apply open standards from the Open Geospatial Consortium and STAC (https://stacspec.org/) for cataloguing. In providing this suite of services, the goal is to provide a toolkit to facilitate application developers and EO specialists in building new applications and tools to exploit the data. This forms the final layer in the architecture: as part of the programme, three example application scenarios have been funded, each partnered with a target set of users. These include 1) an application taking climate projections and land surface temperature datasets to provide risk assessments for land assets (led by SparkGeo); 2) a land cover application (Spyrosoft) and finally 3), rather than an application in its own right, a project to develop a client toolkit for use with Jupyter Notebooks and a plugin integrating the Hub’s functionality into the open source GIS desktop application QGIS (work led by Oxidian).

Over the course of the programme, running in parallel to the system development, a dedicated study has been undertaken to develop a model for future sustainability of the platform tackling engagement with potential users and cost models. At the beginning, a funding call seeded early pilots to investigate application scenarios that the platform could support. As this initial phase of the Hub completes, work is underway to engage with early adopters and provide training resources for new users.

How to cite: Kershaw, P., Evans, R., Moscato, F., Poulter, D., Manning, A., Bulpett, J., Williamson, E., Remedios, J., Graham, A., Tipping, D., and Zaborowski, P.: The UK EO DataHub - a pathfinder programme to develop a data space for UK industry, public and academic sectors, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-17326, https://doi.org/10.5194/egusphere-egu25-17326, 2025.

16:40–16:50
|
EGU25-15282
|
On-site presentation
András Zlinszky and Grega Milcinski

Early career scientists rarely have the resources to work with earth observation data at continental to global scale. This is caused by a combination of factors: large scale data analysis often involves teamwork, connecting data scientists, code developers, IT specialists, statisticians and geoscientists. Young researchers are rarely able to coordinate such a team. Meanwhile, all scientists can have relevant ideas or pose powerful research questions that merit investigation. Copernicus Data Space Ecosystem provides a public, free platform for large-scale processing of earth observation data. It combines instant access to all Sentinel satellite imagery with cloud-based processing in the form of API requests and a powerful browser-based viewing interface. This new approach is enabled by storing the data in a different way: uncompressed formats such as JPEG2000, COG or ZARR support subsetting and querying the image rasters without first unzipping the file, thereby allowing direct streaming of only the area and bands that the user requests. Additionally, this means that most calculations and visualization tasks can be carried out on the server side, directly within the request process. The backend tasks of data storage and management are taken care of by the system, while the user can concentrate on the research itself.

Copernicus Data Space Ecosytem supports several API families. OGC API-s directly enable the creation of Open Geospatial Consortium compatible map products such as WMS, WMTS, WFS or WCS services. These can be accessed with GIS software or displayed in web map tools. OData, STAC, and OpenSearch are Catalog API-s, supporting the querying and of datasets in preparation for analysis. Sentinel Hub is an API family that can handle queries, raster operations, and raster-vector integration for deriving statistics. The main advantages of Sentinel Hub API-s are their efficient use and integration with advanced visualization in the Copernicus Browser.

OpenEO is a fully open-source data analysis framework designed specifically to support FAIR principles. It is independent from data formats with its own data cube format, and can be edited using several coding languages. openEO connects to all STAC-compliant repositories, enabling integration between Sentinel data and other sources. Processing tools include many mathematical operations, but also standard machine learning processes. The system is designed with upscaling in mind: the command structure is the same for small and large areas, with storage and asynchronous processing managed by the backend.

Both API families come with a comprehensive scheme of tutorials and documentation to allow step-by-step learning, and an online Jupyter Lab virtual machine facility. Therefore, early-career scientists with a basic understanding of programming can quickly learn to apply their domain knowledge, while creating solutions that are easy to share and replicate.

All in all, Copernicus Data Space Ecosystem is a transformative tool for earth observation, significantly lowering the bar for applying earth observation at large scale in the geosciences.

How to cite: Zlinszky, A. and Milcinski, G.: Copernicus Data Space Ecosystem empowers early-career scientists to do global scale earth observation data analysis, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-15282, https://doi.org/10.5194/egusphere-egu25-15282, 2025.

16:50–17:00
|
EGU25-11118
|
On-site presentation
Bo Møller Stensgaard, Casper Bramm, Marie Katrine Traun, and Søren Lund Jensen

The exponential growth of LIDAR and satellite data in geoscience presents both opportunities and challenges for users. Traditional data handling methods often struggle with the sheer volume and complexity of these datasets, hindering easy accessibility, efficient analysis and decision-making processes. This presentation introduces the Scandinavian Highlands HEX-Responder platform and database structure, a cutting-edge solution that leverages the power of hexagonal discrete global grid system, Uber H3, and developed processes to revolutionize geospatial data management, fast responsive visualization and analysis.

We will showcase real-world applications, highlighting the platform's potential to accelerate scientific discovery and improve decision-making processes using satellite and remote sensing data.

The platform’s approach offers several advantages over conventional methods:

  • Efficient data organization and retrieval
  • Improved advanced spatial data analyses opportunities
  • Seamless integration of multi-scale and multi-dimensional data without losing information
  • Enhanced, responsive and fast visualisation capabilities

Our ELT (extract, load, transform) and subsequent visualisation procedure can be applied to any big raster data formats. First, the raw raster data is transformed into optimised parquet files through chunked reading and compression based on a low-resolution H3 hexagon cell index (hexagonization), enabling rapid data import to a column-oriented database management system for big data storage, processing and analytics. The H3 cell organisation is preserved in the database through partitioned fetching for visualisation on the platform. This method allows for horizontal scaling and accurate multi-resolution aggregation, preserving data integrity across scales and significantly overcomes typical computational memory limitations.

The platform's capabilities are exemplified by its approach to LIDAR and satellite emissivity data processing using the H3 grid. High-resolution LIDAR data is efficiently gridded and visualized to H3 resolution level 15 hexagons (0.9m2 hexagon cells). The gridding preserves all original pixel raster points while providing aggregated views for seamless zooming.

Another prime example of the capabilities is the handling of NASA’s ASTER Global Emissivity Data (100m resolution). Here, our pipeline transformed 2.1 terabytes of extracted raw CSV-data derived from NASA’s emissivity data into a compressed format based on the H3 index occupying only 593 gigabytes in the database.

This approach not only saves data storage space but also dramatically improves data accessibility and processing speed for the users, allowing users to work in a responsive environment with this massive dataset in ways previously not possible. Each hexagon represents an opportunity to store unlimited amount, types and categories of pre-processed data for more integrative analyses and data insight.

By hexagonizing LIDAR and satellite data, the HEX-Responder platform enables users to explore massive datasets with ease and efficiency in a responsive environment. The integrated procedures allow for detailed information maintenance and retrieval, paving the way for advanced predictive modelling in geoscience applications using earth observation data in a new way.  

How to cite: Stensgaard, B. M., Bramm, C., Traun, M. K., and Jensen, S. L.: Too Big to Handle? Hexagonizing LIDAR and Satellite Data in Geoscience Applications, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-11118, https://doi.org/10.5194/egusphere-egu25-11118, 2025.

17:00–17:05
17:05–17:15
|
EGU25-13394
|
ECS
|
On-site presentation
Milan Klöwer, Tim Reichelt, Juniper Tyree, Ayoub Fatihi, and Hauke Schulz

Climate data compression urgently needs new standards. The continuously growing exascale mountain of data requires compressors that are widely used and supported, essentially hiding the compression details from many users. With the advent of AI revolutionising scientific computing, we have to set the rules of this game. Minimizing information loss, maximising compression factors, at any resolution, grid and dataset size, for all variables, with chunks and random access, while preserving all statistics and derivatives, at a reasonable speed — are squaring the compression circle. Many promising compressors are hardly used as trust among domain scientists is hard to gain: The large spectrum of research questions and applications using climate data is very difficult to satisfy simultaneously.

Here, we illustrate the motivation behind the newly defined climate data compression benchmark ClimateBenchPress, designed as a quality check in all those dimensions of the problem. Any benchmark will inevitably undersample this space, but we define datasets from atmosphere, ocean, and land as well as evaluation metrics to pass. Results are presented as score cards, highlighting strengths and weaknesses for every compressor.

The bitwise real information content shows a systematic way in case no error bounds are known. In the case of the ERA5 reanalysis, errors are estimated and allow us to categorize many variables into linear, log and beta distributions with values bounded from zero, one or both sides, respectively. This allows us to define error thresholds arising from observation and model errors directly, providing another alternative to the still predominant subjective choices. Most error-bounded compressors come with parameters that can be automatically chosen following this analysis.

Also new data formats are on the horizon: Chunking and hierarchical data structures allow and force us to adapt compressors to spatially or length-scale dependent information densities. Extreme events, maybe counterintuitively, often increase the compressibility through higher uncertainties, but lie on the edge or outside of the training data for machine learned-compressors. This again increases the need for well-tested compressors. Benchmarks like ClimateBenchPress are required to encourage new standards for safe lossy climate data compression.

How to cite: Klöwer, M., Reichelt, T., Tyree, J., Fatihi, A., and Schulz, H.: Challenges and perspectives of climate data compression in times of kilometre-scale models and generative machine learning, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-13394, https://doi.org/10.5194/egusphere-egu25-13394, 2025.

17:15–17:25
|
EGU25-5977
|
On-site presentation
Uwe Ehret, Jieyu Chen, and Sebastian Lerch

Meteorological observations (e.g. from weather radar) and the output of meteorological models (e.g. from reanalyses or forecasts) are often stored and used in the form of time series of 2-d spatial gridded fields. With increasing spatial and temporal resolution of these products, and with the transition from providing single deterministic fields to providing ensembles, their size has dramatically increased, which makes use, transfer and archiving a challenge. Efficient compression of such fields - lossy or lossless - is required to solve this problem.

The goal of this work was therefore to apply several lossy compression algorithms for 2d spatial gridded meteorological fields, and to compare them in terms of compression rate and information loss compared to the original fields. We used five years of hourly observations of rainfall and 2m air temperature on a 250 x 400 km region over central Germany on a 1x1 km grid for our analysis.

In particular, we applied block averaging as a simple benchmark method, Principal Component Analysis, Autoencoder Neural Network (Hinton and Salakhutdinov, 2006) and the Ramer-Douglas-Peucker algorithm (Ramer, 1972; Douglas and Peucker, 1973) known from image compression. Each method was applied for various compression levels, expressed as the number of objects of the compressed representation, and then the (dis-)similarity of the original field and the fields reconstructed from the compressed fields was measured by Mean Absolute Error, Mean Square Error, and the Image Quality Index (Wang and Bovik, 2002). First results indicate that even for spatially heterogeneous fields like rainfall, very high compression can be achieved with small error.

 

References

Douglas, D., Peucker, T.: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. In: The Canadian Cartographer. Bd. 10, Nr. 2, 1973, ISSN 0008-3127, S. 112–122, 1973.

Hinton, G. E., & Salakhutdinov, R. R.: Reducing the dimensionality of data with neural networks. science, 313(5786), 504-507, 2006.

Ramer, U.: An iterative procedure for the polygonal approximation of plane curves, Computer Graphics and Image Processing, 1, 244-256, http://dx.doi.org/10.1016/S0146-664X(72)80017-0, 1972.

Zhou Wang, and A. C. Bovik: A universal image quality index, IEEE Signal Processing Letters, 9, 81-84, 10.1109/97.995823, 2002.

How to cite: Ehret, U., Chen, J., and Lerch, S.: A comparative study of algorithms for lossy compression of 2-d meteorological gridded fields, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5977, https://doi.org/10.5194/egusphere-egu25-5977, 2025.

17:25–17:35
|
EGU25-7371
|
ECS
|
On-site presentation
Robert Underwood, Jinyang Liu, Kai Zhao, Sheng Di, and Franck Cappello

    As climate and weather scientists strive to increase accuracy and understanding of our world, models of weather and climate have increased in their resolution to square kilometers scale and become more complex increasing their demands for data storage. A recent study SCREAM run at 3.5km resolution produced nearly 4.5TB of data per simulated day, and the recent CMIP6 simulations produced nearly 28PB of data. At the same time, storage and power capacity at facilities conducting climate experiments are not increasing at the same rate as the volume of climate and weather datasets leading to a pressing challenge to reduce data volumes. While some in the weather and climate community have adopted lossless compression, these techniques frequently produce compression ratios on the order of 1.3$\times$, which are insufficient to alleviate storage constraints on facilities. Therefore, additional techniques, such as science-preserving lossy compression that can achieve higher compression ratios, are necessary to overcome these challenges.

    While data compression is an important topic for climate and weather applications, many of the current assessments of the effectiveness of climate and weather datasets do not consider the state of the art in compressor design and instead, asses scientific compressors that are 3-11 years old, substantially behind the state of the art. In this report: 

 

  •  We assess the current state of the art in advanced scientific lossy compressors against the state of the art in quality assessment criteria proposed for the ERA5 dataset to assess the current gaps between needed performance requirements and the capabilities of the current compressors.
  • We present new capabilities that allow us to build an automated, user-friendly, and extensible pipeline for quickly finding compressor configurations that maximize compression ratios while preserving scientific integrity of the data using codes developed as part of the NSF FZ project.
  • We demonstrate a number of capabilities that facilitate use within in the weather and climate community including NetCDF, HDF5, and GRIB file format support; support for innovation via Python, R, and Julia as well as low level languages such as C/C++; and the implementations of commonly used climate quality metrics including dSSIM, and the ability to extend to add new metrics in high-level languages
  • Utilizing this pipeline, We find that with advanced scientific compressors, it is possible to achieve a 6.4x improvement or more in compression ratio over previously evaluated compressors

How to cite: Underwood, R., Liu, J., Zhao, K., Di, S., and Cappello, F.: Evaluating Advanced Scientific Compressors on Climate Datasets, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7371, https://doi.org/10.5194/egusphere-egu25-7371, 2025.

17:35–17:45
|
EGU25-17172
|
ECS
|
On-site presentation
Amelie Koch, Isabelle Wittmann, Carlos Gomez, Rikard Vinge, Michael Marszalek, Conrad Albrecht, and Thomas Brunschwiler

The exponential growth of Earth Observation data presents challenges in storage, transfer, and processing across fields such as climate modeling, disaster response, and agricultural monitoring. Efficient compression algorithms—either lossless or lossy—are critical to reducing storage demands while preserving data utility for specific applications. Conventional methods, such as JPEG and WebP, rely on hand-crafted base functions and are widely used. However, Neural Compression, a data-driven approach leveraging deep neural networks, has demonstrated superior performance by generating embeddings suitable for high levels of entropy encoding, enabling more accurate reconstructions at significantly lower bit rates.

In our prior work, we developed a Neural Compression pipeline utilizing a masked auto-encoder, embedding quantization, and an entropy encoder tailored for satellite imagery [1]. Instead of reconstructing original images, we evaluated the reconstructed embeddings for downstream tasks such as image classification and semantic segmentation. In this study, we conducted an ablation analysis to quantify the contributions of individual pipeline components—encoder, quantizer, and entropy encoder—toward the overall compression rate. Our findings reveal that satellite images achieve higher compression rates compared to ImageNet samples due to their lower entropy. Furthermore, we demonstrate the advantages of learned entropy models over hand-crafted alternatives, achieving better compression rates, particularly for datasets with seasonal or geospatial coherence. Based on these insights, we provide a list of recommendations for optimizing Neural Compression pipelines to enhance their performance and efficiency.

This work was conducted under the Embed2Scale project, supported by the Swiss State Secretariat for Education, Research and Innovation (SERI contract no. 24.00116) and the European Union (Horizon Europe contract no. 101131841).

[1] C. Gomes and T. Brunschwiler, “Neural Embedding Compression for Efficient Multi-Task Earth Observation Modelling,” IGARSS 2024, Athens, Greece, 2024, pp. 8268-8273, doi: 10.1109/IGARSS53475.2024.10642535.

How to cite: Koch, A., Wittmann, I., Gomez, C., Vinge, R., Marszalek, M., Albrecht, C., and Brunschwiler, T.: Neural Embedding Compression for Earth Observation Data – an Ablation Study, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-17172, https://doi.org/10.5194/egusphere-egu25-17172, 2025.

17:45–17:55
|
EGU25-20430
|
On-site presentation
David Hassell, Sadie Bartholomew, Bryan Lawrence, and Daniel Westwood

The CF (Climate and Forecast) metadata conventions for netCDF datasets describe means of "compression-by-convention", i.e. methods for compressing and decompressing data according to algorithms that are fully described within the conventions themselves. These algorithms, which can be lossless or lossy, are not applicable to arbitrary data, rather the data have to exhibit certain characteristics to make the compression worthwhile, or even possible.

Aggregation, available in CF-1.13, provides the utility of being able to view, as a single entity, a dataset that has been partitioned across multiple other independent datasets on disk, whilst taking up very little extra space on disk since the aggregation dataset contains no copies of the data in each component dataset. Aggregation can facilitate a range of activities such as data analysis, by avoiding the computational expense of deriving the aggregation at the time of analysis; archive curation, by acting as a metadata-rich archive index; and the post-processing of model simulation outputs, by spanning multiple datasets written at run time that together constitute a more cohesive and useful product. CF aggregation currently has cf-python and xarray implementations.

The conceptual CF data model does not recognise compression nor aggregation, choosing to view all CF datasets as if they were uncompressed and containing all of their own data. As a result, the cf-python data analysis library, that is built exactly on the CF data model, also presents datasets lazily to the user in this manner, without decompressing or re-combining the data in memory until the user actually accesses the data, at which time it occurs automatically. This approach allows the user to interact with their data in an intuitive and efficient manner; and also removes the need for the user to have to assimilate large parts of the CF conventions and having to create their own code for dealing with the compression and aggregation algorithms.

We will introduce compression by ragged arrays (as used by Discrete Sampling Geometry features, such as timeseries and trajectories) and dataset aggregation, with cf-python examples to demonstrate the ease of use that arises from using the CF data model interpretation of the data.

How to cite: Hassell, D., Bartholomew, S., Lawrence, B., and Westwood, D.: Compression and Aggregation: a CF data model approach, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-20430, https://doi.org/10.5194/egusphere-egu25-20430, 2025.

17:55–18:00

Posters on site: Thu, 1 May, 10:45–12:30 | Hall X4

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Thu, 1 May, 08:30–12:30
Chairpersons: Clément Bouvier, Sara Faghih-Naini
X4.68
|
EGU25-15672
|
ECS
Oriol Tinto, Xavier Yepes, and Pierre Antoine Bretonniere

The rapid growth of Earth System Sciences (ESS) datasets, driven by high-resolution numerical modeling, has outpaced storage and data-sharing capabilities. To address these challenges, we investigated lossy compression techniques as part of the EERIE project, aiming to significantly reduce storage demands while maintaining the scientific validity of critical diagnostics.

Our study examined two key diagnostics: Sea Surface Height (SSH) variability and ocean density, essential for understanding climate dynamics. Leveraging tools such as SZ3 and enstools-compression, we achieved data volume reductions by orders of magnitude without compromising the diagnostics' accuracy. Compression-induced differences were found to be negligible compared to the inherent variability between model outputs and observational datasets, underscoring the robustness of these methods.

Additionally, our work highlighted inefficiencies in current workflows, including the prevalent use of double precision in post-processing. We proposed improvements to align data precision with the original model outputs, further optimizing storage and computation. Integrating lossy compression into existing workflows via widely used formats like NetCDF and HDF5 demonstrates a practical path forward for sustainable ESS data management.

This study showcases the transformative potential of lossy compression to make high-resolution datasets more manageable, ensuring they remain accessible and scientifically reliable for stakeholders while significantly reducing resource demands.

How to cite: Tinto, O., Yepes, X., and Bretonniere, P. A.: Scaling Down ESS Datasets: Lessons from the EERIE Project on Compression, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-15672, https://doi.org/10.5194/egusphere-egu25-15672, 2025.

X4.69
|
EGU25-16791
|
ECS
Jae-Ho Lee, Yong Sun Kim, and Sung-Dae Kim

This study developed a monthly regional atlas for dissolved oxygen (DO) with a quarter-degree horizontal resolution and 73 vertical levels over the northwestern Pacific. We used observed profiles of 586,851 and gridded World Ocean Atlas 2023 (WOA23) with 1° resolution by adopting simple kriging horizontal interpolation and vertical stabilizing techniques to produce the new atlas. This approach efficiently mitigates artificial water masses and statistical noise. The new DO climatology provides detailed information along coasts and renders realistic oxygen distribution associated with the current system in the western North Pacific compared to WOA23. A meridional section demonstrates that the newly developed atlas does not yield artificial noise-like spikes frequently observed in WOA23 in the East Sea. This study expects that this new atlas can allow bio-geochemical numerical models to enhance diagnostic and forecasting performance.

How to cite: Lee, J.-H., Kim, Y. S., and Kim, S.-D.: Development and performance evaluation of dissolved oxygen climatology in the Northwestern Pacific, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-16791, https://doi.org/10.5194/egusphere-egu25-16791, 2025.

X4.70
|
EGU25-19418
Karsten Peters-von Gehlen, Juniper Tyree, Sara Faghih-Naini, Peter Dueben, Jannek Squar, and Anna Fuchs

It is apparent that the data amounts expected to be generated by current and upcoming Earth System Science research and operational activities stress the capabilities of HPC and associated data infrastructures. Individual research projects focusing on running global Earth System Models (ESMs) at spatial resolution of 5km or less can easily occupy several petabytes on disk. With multiple of such projects running on a single HPC infrastructure, the challenge of storing the data alone becomes apparent. Further, community-driven activities like model intercomparison projects – which are conducted for both conventional and high-resolution model setups – add to the aforementioned strain on storage systems. Hence, when planning for next-generation HPC systems, the storage requirements of state-of-the-art ESM-centered projects have to be clear so that systems are still fit-for-use 5 years down the road from the initial planning stage.

As computational hardware costs per performance unit (FLOP or Byte) are not decreasing anymore like they have in the past decades, HPC system key figures do not increase substantially anymore from one generation to the next. The mismatch between demands of research and what future systems can offer is therefore clear.

One apparent solution to this problem is to simply reduce the amount of data from ESM simulations stored on a system. Data compression is one candidate to achieve this. Current ESM projects already utilize application-side lossless compression techniques, which help reduce storage space. However, decompression may incur performance penalties, especially when read patterns misalign with the compression block sizes. Lossy compression offers the potential for higher compression rates, without access penalties for data retrieval. However, its suitability is highly content-dependent, raising questions about which lossy compression methods are best suited for specific datasets. On a large scale, applying lossy compression also prompts the consideration of how such data reduction could shape the design of next-generation HPC architectures.

With lossy compression not being very popular in the ESM-community so far, we present a key development of the ongoing ESiWACE3 project: an openly accessible Jupyter-based online laboratory for testing lossy compression techniques on ESM output datasets. This online tool currently comes with a set of notebooks allowing users to objectively evaluate the impact lossy compression has on analyses performed on the compressed compared to the input data. With some compressors promising compression ratios of 10x-1000x, providing such tools to ensure compression quality is essential. The motivation behind the online compression laboratory is to foster the acceptance of lossy compression techniques by conveying first-hand experience and immediate feedback of benefits or drawbacks of applying lossy compression algorithms. 

Going one step further, we illustrate the impacts that applying lossy-compression techniques on ESM data on large-scales can have on the design decisions made for upcoming HPC infrastructures. We illustrate, among others, that increased acceptance and application of lossy compression techniques enables more efficient resource utilization and allows for smarter reinvestment of funds saved from reduced storage demands, potentially leading to the acquisition of smaller systems and thus enabling increased research output per resource used.

How to cite: Peters-von Gehlen, K., Tyree, J., Faghih-Naini, S., Dueben, P., Squar, J., and Fuchs, A.: Lossy Data Compression Exploration in an Online Laboratory and the Link to HPC Design Decisions, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-19418, https://doi.org/10.5194/egusphere-egu25-19418, 2025.

X4.71
|
EGU25-12760
|
ECS
Niklas Böing, Johannes Holke, Achim Basermann, Gregor Gassner, and Hendrik Fuchs

Large-scale Earth system model simulations produce huge amounts of data. Due to limited I/O bandwidth and available storage space this data often needs to be reduced before written to disk or stored permanently. Error-bounded lossy compression is an effective approach to tackle the trade-off between accuracy and storage space.

We are exploring and discussing lossless as well as error-bounded lossy compression based on tree-based adaptive mesh refinement/coarsening (AMR) techniques. Our lossy compression schemes allow for absolute and relative error bounds. The data reduction methods are closely linked to an underlying (adaptive) mesh which easily permits error regions of different error tolerances and criteria – in particular, we allow nested domains of varying error tolerances specified by the user. Moreover, some of the compressed data structures allow for an incremental decompression in the resolution of the data which may be favorable for transmission and visualization.

We implement these techniques as the open source tool cmc, which is based on the parallel AMR library t8code. The compression tool can be linked to and used by arbitrary simulation applications or executed as a post-processing step. We show different application results of the compression in comparison to current state-of-the-art compression techniques on several benchmark data sets.

How to cite: Böing, N., Holke, J., Basermann, A., Gassner, G., and Fuchs, H.: Tree-Based Adaptive Data Reduction Techniques for Scientific Simulation Data, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-12760, https://doi.org/10.5194/egusphere-egu25-12760, 2025.

X4.72
|
EGU25-17102
Mi-Jin Jang, Jae-Ho Lee, and Yong Sun Kim

Surface ocean current is crucial for enhancing the safety and efficiency of maritime logistics and transportation, boosting fisheries production and management, and supporting military operations. This study analyzed 25,342 trajectories from NOAA’s Global Drifter Program (1991–2020), 12 from KIOST, and 63 from KHOA (2015–2024). The surface drifters entering the East Sea were extracted, and a five-step quality control process was implemented. Unobserved values were removed, quality control was applied based on drogue lost, abnormally speed or stuck, unrealistic acceleration. To estimated the gridded oceanic current with high-resolution, we removed the Ekman current and tides from the observed velocity and took advantage of a simple kriging approach. The validation against existing datasets confirmed that major ocean currents exhibited similar patterns compared to absolute geostrophic current from the satellite-based altimetry. The constructed dataset is expected to contribute to the accurate identification of surface current movements and the development of realistic models that incorporate regional characteristics based on data assimilation.

How to cite: Jang, M.-J., Lee, J.-H., and Kim, Y. S.: Calculation of Gridded Surface Current from Observed Lagrangian Trajectories in the East Sea, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-17102, https://doi.org/10.5194/egusphere-egu25-17102, 2025.

X4.73
|
EGU25-20188
|
ECS
Pieter Rijsdijk, Henk Eskes, Kazuyuki Miyazaki, Takashi Sekiya, and Sander Houweling

Satellite observations of tropospheric trace gases and aerosols are evolving rapidly. Recently launched instruments provide increasingly higher spatial resolutions with footprint diameters in the range of 2–8 km, with daily global coverage for polar orbiting satellites or hourly observations from geostationary orbit. Often the modelling system has a lower spatial resolution than the satellites used, with a model grid size in the range of 10–100 km. When the resolution mismatch is not properly bridged, the final analysis based on the satellite data may be degraded. Superobservations are averages of individual observations matching the resolution of the model and are functional to reduce the data load on the assimilation system. In this paper, we discuss the construction of superobservations, their kernels and uncertainty estimates. The methodology is applied to nitrogen dioxide tropospheric column measurements of the TROPOMI instrument on the Sentinel-5P satellite. In particular, the construction of realistic uncertainties for the superobservations is non-trivial and crucial to obtaining close to optimal data assimilation results. We present a detailed methodology to account for the representativity error when satellite observations are missing due to e.g. cloudiness. Furthermore, we account for systematic errors in the retrievals leading to error correlations between nearby individual observations contributing to one superobservation. Correlation information is typically missing in the retrieval products where an error estimate is provided for individual observations. The various contributions to the uncertainty are analysed: from the spectral fitting, the estimate of the stratospheric contribution to the column and the air-mass factor. The method is applied to TROPOMI data but can be generalised to other trace gases such as HCHO, CO, SO2 and other instruments such as OMI, GEMS and TEMPO. The superobservations and uncertainties are tested in the ensemble Kalman filter chemical data assimilation system developed by JAMSTEC. These are shown to improve forecasts compared to thinning or compared to assuming fully correlated or uncorrelated uncertainties within the superobservation. The use of realistic superobservations within model comparisons and data assimilation in this way aids the quantification of air pollution distributions, emissions and their impact on climate.

How to cite: Rijsdijk, P., Eskes, H., Miyazaki, K., Sekiya, T., and Houweling, S.: Creating TROPOMI superobservations for data assimilation and model evaluation, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-20188, https://doi.org/10.5194/egusphere-egu25-20188, 2025.

X4.74
|
EGU25-1294
Flavien Gouillon, Cédric Penard, Xavier Delaunay, and Sylvain Herlédan

NetCDF (Network Common Data Form) is a self-describing, portable and platform-independent format for array-oriented scientific data which has become a community standard for sharing measurements and analysis results in the fields of oceanography, meteorology but also in the space domain.

The volume of scientific data is continuously increasing at a very fast rate. Object storage, a new paradigm that appeared with cloud infrastructures, can help with data storage and parallel access issues, but NetCDF may not be able to get the most out of this technology without some tweaks and fine tuning.

The availability of ample network bandwidth within cloud infrastructures allows for the utilization of large amounts of data. Processing data       where the data is located is preferable as it can result in substantial resource savings. But for some use cases downloading data from the cloud is required (e.g. processing also involving confidential data) and results still have to be fetched once processing tasks have been executed on the cloud.

Networks      exhibit significant variations in capacity and quality (ranging from fiber-optic and copper connections to satellite connections with poor reception in degraded conditions on boats, among other scenarios). Therefore, it is crucial for formats and software libraries to be specifically designed to optimize access to      data by minimizing the transfer to only what is strictly necessary.

In this context, a new approach has emerged in the form of a library that indexes the content of netCDF-4 datasets. This indexing enables the retrieval of sub-chunks, which are pieces of data smaller than a chunk, without the need to reformat the existing files. This approach targets access patterns such as time series in netCDF-4 datasets formatted with large chunks.

This report provides a performance assessment of netCDF-4 datasets for varied use cases. This assessment executes these use cases under various conditions, including POSIX and S3 local filesystems, as well as a simulated degraded network connection. The results of this assessment may provide guidance on the most suitable and most efficient library for reading netCDF data in different situations.

How to cite: Gouillon, F., Penard, C., Delaunay, X., and Herlédan, S.: A new sub-chunking strategy for fast netCDF-4 access in local, remote and cloud infrastructures. , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-1294, https://doi.org/10.5194/egusphere-egu25-1294, 2025.

X4.75
|
EGU25-4277
Tjerk Krijger, Peter Thijsse, Robin Kooyman, and Dick Schaap

As part of European projects, such as EOSC related Blue-Cloud2026, EOSC-FUTURE and FAIR-EASE, MARIS has developed and demonstrated a software system called BEACON with a unique indexing system that can, on the fly with high performance, extract data subsets based on the user’s request from millions of heterogeneous observational data files. The system returns one single harmonised file as output, regardless of whether the input contains many different data types or dimensions. 

Since in many cases the original data collections that are imported in a BEACON installment contain millions of files (e.g. Euro-Argo, SeaDataNet, ERA5, World Ocean Database), it is hard to achieve fast responses. Next to this, these large collections also require a large storage capacity. To mitigate these issues, we wanted to optimize the internal file format that is used within BEACON. With the aim of reducing the data storage size and speeding up the data transfer, while guaranteeing that the information of the original data files is maintained. As a result, the BEACON software has included a unique file format called the “BEACON Binary Format (BBF)” that meets these requirements. 

The BBF is a binary data format that allows for storing multi-dimensional data as apache arrow arrays with zero deserialization costs. This means that computers can read the data stored on disk, as if it were computer memory, significantly reducing computational access time by eliminating the cost for a computer to translate what’s on disk, to computer memory.

Together with making the entire data format “non-blocking”, which means that all computer cores can access the file at the same time and simultaneously use the jump table to read millions of datasets in parallel. This enables a level of performance which reaches speeds of multiple GB/s, making the hardware the bottleneck instead of the software.

Furthermore, the format takes a unique approach to compressing data by adjusting the way it compresses and decompresses on a per dataset level. This means that every dataset is compressed in a slightly different manner, making it much more effective in terms of size reduction and time to decompress the data which can get close to the effective memory speed of a computer.

It does this while retaining full data integrity. No data is ever lost within this format, nor is any data adjusted. If one were to import a NetCDF file into BBF, one could fully rebuild the original NetCDF file from the BBF file itself. In the presentation the added benefits of using the BBF will be highlighted by comparing and benchmarking it to traditional formats such as NetCDF, CSV, ASCII, etc.

In January 2025, BEACON 1.0.0 was made publicly available as an open-source software, allowing everyone to set-up their own BEACON node to enhance the access to their data, while at the same time being able to reduce the storage size of their entire data collection without losing any information. More technical details, example applications and general information on BEACON can be found on the website https://beacon.maris.nl/.

How to cite: Krijger, T., Thijsse, P., Kooyman, R., and Schaap, D.: BEACON Binary Format (BBF) - Optimizing data storage and access to large data collections, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4277, https://doi.org/10.5194/egusphere-egu25-4277, 2025.

X4.76
|
EGU25-13567
Joan Masó, Marta Olivé, Alba Brobia, Nuria Julia, Nuria Cartell, and Uta Wehn

The Green Deal Data Space is born in the big data paradigm where there is a variety of data formats and data models that are exposed as files or web APIs. As a result, we need to default in simple data structure that is transversal enough to be able to represent most of the more specific data models, formats and API payloads. Many data models present a structure that can be represented as tables.

TAPIS stands for "Tables from APIS". It is a JavaScript code that uses a common data model that is an array of objects with a list of properties that can contain a simple or a complex value. In TAPIS offers a series of operations that use one or more arrays of objects as inputs and produce a new array of objects as an output. There are operations that create the arrays of objects from files or API queries (a.k.a. data import), others that manipulate the objects (e.g. merge two arrays in a single one) and some operations that generate visual representations of the common data structure including tabular, a map, a graph, etc.

TAPIS is limited by its own data model. While many of the data models can be mapped to the common data model, a multidimensional data cube or a data tree cannot be represented in a single table in an efficient way. In the context of the Green Deal Data Space, most of the sensor data, statistical data, geospatial feature based data and administrative data can be considered object based data and can be used in TAPIS. TAPIS is able to connect to Sensor Things API (the sensor protocol selected in AD4GD and CitiObs), S3 buckets (the internal cloud repository used in AD4GD), GeoNetwork (the geospatial metadata catalogue selected in AD4GD and more4nature), and the OGC API features and derivates (the modern web API interfaces standardized by the OGC) but other data inputs will be incorporated, such as Citizen Science data sources and other popular APIs used in the more4nature project. More analytical functionalities are going to be incorporated in the CitiObs project. As part of the AD4GD Green Deal Information Model, there is an operation to associate semantics to each column of a table by linking it to a URI that defines the concept in an external vocabulary (as well as units of measure if appropriate). In order to be compatible with the data space architecture recommended by the International Data Space Association, we are working on supporting the catalogue of the Eclipse Data Connector, and to be able to negotiate a digital contract as a previous step to request access to the relevant data offered in the data space. To do so, we are working on incorporating the data space protocol as part of the TAPIS operations for data import. TAPIS is available as open source at https://github.com/joanma747/TAPIS.

AD4GD, CitiObs and more4nature are Horizon Europe projects co-funded by the European Union, Switzerland and the United Kingdom.

How to cite: Masó, J., Olivé, M., Brobia, A., Julia, N., Cartell, N., and Wehn, U.: Tables as a way to deal with a variety of data formats and APIs in data spaces, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-13567, https://doi.org/10.5194/egusphere-egu25-13567, 2025.

X4.77
|
EGU25-17799
Matthes Rieke, Benjamin Proß, Simon Jikra, Sotiris Aspragkathos, Iasonas Sotiropoulos, Stamatia Rizou, and Lisa Pourcher

The concept of Data Spaces has gained traction in recent years. Major representatives emerged which have the technological maturity as well as support by relevant decision and policy makers (e.g.  the International Data Spaces Association (IDSA) or Gaia-X). These follow different architectural approaches. In this session we want to illustrate the challenges of integrating the Data Space architectures with established concepts of Spatial Data Infrastructure.

During the next 4 years, the ENFORCE project (Empower citizeNs to join Forces with public authORities in proteCting the Environment) is dedicated to fostering sustainable practices and ensuring environmental regulatory compliance by integrating citizen science with innovative technologies. By employing Living Labs and citizen science methodologies, ENFORCE will create innovative tools that bridge the gap between data reporting, monitoring, and policy enforcement. The project integrates data collection (e.g. Copernicus satellite data), analysis, and stakeholder participation to meet these goals. ENFORCE will leverage geospatial intelligence and explainable AI to enhance environmental governance. These tools and strategies will be tested and refined at eight pilot sites in seven countries, supplemented by capacity-building and policy recommendation efforts.

The design and development of a geospatial information infrastructure that supports the envisioned data workflows is a key challenge addressed by ENFORCE. This infrastructure will prioritize the integration of OGC API-driven systems into the Data Space ecosystem, forming a central component of the project’s agenda. Through development of a blueprint architecture for integration, the project will identify gaps and missing components in current systems, aligning with standards such as the FAIR principles and open data. The concepts will be facilitated in an ENFORCE “Tools Plaza”, an innovative platform providing data science and analytical capabilities for environmental compliance workflows.

How to cite: Rieke, M., Proß, B., Jikra, S., Aspragkathos, S., Sotiropoulos, I., Rizou, S., and Pourcher, L.: Data Spaces and geodata workflows for environmental protection, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-17799, https://doi.org/10.5194/egusphere-egu25-17799, 2025.

X4.78
|
EGU25-15864
Sabrina H. Szeto, Julia Wagemann, Emmanuel Mathot, and James Banting

The Standard Archive Format for Europe (SAFE) specification has been the established approach to publishing Copernicus Sentinel data products for over a decade. While SAFE has pushed the ecosystem forward through new ways to search and access the data, it is not ideal for processing large volumes of data using cloud computing. Over the last few years, data standards like STAC and cloud-native data formats like Zarr and COGs have revolutionised how scientific communities work with large-scale geospatial data and are becoming a key component of new data spaces, especially for cloud-based systems.

The ESA Copernicus Earth Observation Processor Framework (EOPF) will be providing access to “live” sample data from the Copernicus Sentinel missions -1, -2 and -3 in the new Zarr data format. This set of reprocessed data allows users to try out accessing and processing data in the new format and experiencing the benefits thereof with their own workflows.

This presentation introduces a community-driven toolkit that facilitates the adoption of the Zarr data format for Copernicus Sentinel data. The creation of this toolkit was driven by several motivating questions: 

  • What common challenges do users face and how can we help them overcome them? 
  • What resources would make it easier for Sentinel data users to use the new Zarr data format? 
  • How can we foster a community of users who will actively contribute to the creation of this toolkit and support each other?

The Sentinels EOPF Toolkit team, comprising Development Seed, SparkGeo and thriveGEO, together with a group of champion users (early-adopters), are creating a set of Jupyter Notebooks and plug-ins that showcase the use of Zarr format Sentinel data for applications across multiple domains. In addition, community engagement activities such as a notebook competition and social media outreach will bring Sentinel users together and spark interaction with the new data format in a creative yet supportive environment. Such community and user adoption efforts are necessary in order to overcome adoption and uptake barriers and to build up trust and excitement to try out new technologies and new developments around data spaces.

In addition to introducing the Sentinels EOPF Toolkit, this presentation will also highlight lessons learned from working closely with users on barriers they face in adopting the new Zarr format and how to address them. 

How to cite: Szeto, S. H., Wagemann, J., Mathot, E., and Banting, J.: The Sentinels EOPF Toolkit: Driving Community Adoption of the Zarr data format for Copernicus Sentinel Data, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-15864, https://doi.org/10.5194/egusphere-egu25-15864, 2025.

X4.79
|
EGU25-17171
|
ECS
Marcin Niemyjski and Jan Musiał

The Copernicus Program is the largest and most successful public space program globally. It provides continuous data across various spectral ranges, with an archive exceeding 84 petabytes and a daily growth of approximately 20 TB, both of which are expected to increase further. The openness of its data has contributed to the widespread use of Earth observation and the development of commercial products utilizing open data in Europe and worldwide. The entire archive, along with cloud-based data processing capabilities, is available free of charge through the Copernicus Data Space Ecosystem initiative and continues to evolve to meet global user standards. 

This paper presents the process of creating the STAC Copernicus Data Space Ecosystem catalog—the largest and most comprehensive STAC catalog in terms of metadata globally. It details the workflow, starting from the development of a metadata model for Sentinel data, through efficient indexing based on the original metadata files accompanying the products, to result validation and backend system ingestion (via database DSN). A particular highlight is that this entire process is executed using a single tool, eometadatatool, initially developed by DLR, further enhanced, and released as open-source software by the CloudFerro team. The eometadatatool facilitates metadata extraction from the original files accompanying Copernicus program products and others (e.g., Landsat, Copernicus Contributing Missions) using a CSV file containing the metadata name, the file in which it occurs, and the path to the key within the file. Since the CDSE repository operates as an S3 resource offering users free access, the tool supports product access via S3 resources by default, configurable through environment variables. All the above characterizes eometadatatool as the most powerful stactool (a high-level command-line tool and Python library for working with STAC) package available, providing both valid STAC items and a method for uploading them to the selected backend. 

The standard specification itself has been influenced by the CDSE catalog development process, which contributed to the evolution of the standard by introducing version 1.1 and updated extensions (storage, eo, proj) that better meet user needs. The paper discusses the most significant modifications, their impact on the catalog’s functionality, and outlines the main differences. 

Particular attention is given to performance optimization due to the substantial data volume and high update frequency. The study examines the configuration and performance testing (using Locust) of the frontend layer (stac-fastapi-pgstac) and backend (pgstac). The stac-fastapi-pgstac implementation was deployed on a scalable Kubernetes cluster and underwent a product hydration process (specific to managing JSON data in pgstac), leveraging Python's native capabilities for this task. The pgstac schema was deployed on a dedicated bare-metal server with a PostgreSQL database, utilizing master-worker replication enabled through appropriate pgstac configuration. Both software tools are open source, and the achieved optimal configurations are documented and will be presented in detail. 

The presented solution empowers the community to fully utilize the new catalog, leverage its functionalities, and access open tools that enable independent construction of STAC catalogs compliant with ESA and community recommendations. 

How to cite: Niemyjski, M. and Musiał, J.: Building the Copernicus Data Space Ecosystem STAC Catalog: Methodologies, Optimizations, and Community Impact, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-17171, https://doi.org/10.5194/egusphere-egu25-17171, 2025.