EGU26-12951, updated on 14 Mar 2026
https://doi.org/10.5194/egusphere-egu26-12951
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Poster | Wednesday, 06 May, 14:00–15:45 (CEST), Display time Wednesday, 06 May, 14:00–18:00
 
Hall X2, X2.2
Testing cloud-optimized formats for future data archival & distribution
Jonathan Schaeffer1, Albane Lecointre2, Laura Ermert2, Alex Hamilton3, and Javier Quinteros4
Jonathan Schaeffer et al.
  • 1Observatoire des Sciences de l'Univers de Grenoble, CNRS, INSU, France
  • 2Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, CNRS, IRD, Univ. Gustave Eiffel, ISTerre, Grenoble, France
  • 3Earthscope, United States
  • 4Helmholtz Centre Potsdam, GFZ German Research Centre for Geosciences, Germany

The seismological community is producing ever more datasets, and datasets themselves increase in size and complexity. To support the community in FAIRly archiving this data, seismological data centers must find solutions for the storage and distribution of such large and diverse data. Different aspects related to the FAIRness of these datasets are being considered, among others during the GeoInquire project: appropriate metadata to describe experiments, access control for both data and metadata, and improved data formats for raw data. For the first two topics, two proposals are currently under review by FDSN Working Groups, and a positive decision is expected in the coming months. Access control is particularly important for distributed acoustic sensing (DAS) data, given the sensitivity of the detailed cable location information.

Concerning improved raw data storage, one possible way forward is to combine cloud storage, already in place at several data centers, with asynchronous services that provide direct links to data. Ideally, this will allow users to load specific segments of data from the cloud storage using high-level languages like Python and massively parallelize their processing. In such a setting, analysis-ready, cloud-optimized data might provide advantages over traditional miniSEED archives; previous studies also suggest advantages over the HDF5 formats commonly output by distributed acoustic sensing (DAS) devices and currently used to store the majority of DAS data.

In this contribution, we report on ongoing collaborative work to systematically evaluate cloud-optimized formats on commercial and on-premise (university or institute) cloud storage services to evaluate their usability for archival, distribution and analysis-ready access to large datasets. We tested I/O performance and storage aspects in Zarr, tileDB and Apache Iceberg on AWS and self-hosted S3 buckets. We will report on test results and a first scientific use case that utilizes data on an on-premise cloud. We will also compare challenges and opportunities of these storage solutions for DAS and for large-N nodal data.

The zarr format is already used in the Earth Science community and, combined with rich metadata and the xarray library, turns out to provide very user-friendly access and data slicing for DAS data. The TileDB format provides similarly good access and slicing, but is less well known in the Earth Science community and requires careful engineering of data ingestion and maintenance. With this presentation, we aim to provide updates on the ongoing collaboration, show first usage examples for scientific workflows, and to stimulate discussion about future seismological data archives.

How to cite: Schaeffer, J., Lecointre, A., Ermert, L., Hamilton, A., and Quinteros, J.: Testing cloud-optimized formats for future data archival & distribution, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12951, https://doi.org/10.5194/egusphere-egu26-12951, 2026.