Accelerating Data Discovery: Automated, Scalable Harvesting and Indexing of Metadata Across Heterogeneous Storage Backends

Christopher Kadow; Martin Bergemann; Mostafa Hadizadeh; Manuel Reis; Etor Lucio Eceiza

doi:https://doi.org/10.5194/egusphere-egu26-9951

[Back] [Session ESSI3.1]

EGU26-9951, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-9951

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Accelerating Data Discovery: Automated, Scalable Harvesting and Indexing of Metadata Across Heterogeneous Storage Backends

Christopher Kadow, Martin Bergemann, Mostafa Hadizadeh, Manuel Reis, and Etor Lucio Eceiza

Christopher Kadow et al.

German Climate Computing Centre (DKRZ), Hamburg, Germany

The discoverability and access of climate and Earth-system datasets is foundational for effective scientific analysis workflows, yet these datasets are often hosted across diverse storage systems and follow a variety of organisational conventions. Researchers and infrastructure engineers face challenges in ingesting distributed metadata into unified, searchable catalogues without sacrificing interoperability or scalability. Efficient metadata harvesting, normalisation, and ingestion at scale are therefore critical enablers for data discovery and FAIR (Findable, Accessible, Interoperable, and Reusable) data practices.

To address this need, we present the Metadata Crawler, a metadata ingestion tool within the designed to automate the collection and indexing of climate dataset metadata across heterogeneous storage backends. The Metadata Crawler supports multi-backend discovery, including POSIX file systems, S3/MinIO object stores, and OpenStack Swift, enabling infrastructure administrators to aggregate metadata from local archives, cloud object storage, and institutional repositories.

At its core, the Metadata Crawler implements a two-stage pipeline: harvested metadata are first collected into a temporary catalogue, and then indexed into downstream systems such as Apache Solr or MongoDB. Dataset definitions, directory structures, and extraction logic are governed by a flexible TOML configuration that encodes Data Reference Syntax (DRS) dialects for different standards. Users can make use of pre-defined standards or define their own, making the tool extremely flexible and versatile. This schema-driven approach, combined with path and data specifications, conditional rules, and computed fields, ensures consistent representation of key facets such as temporal coverage, geospatial bounds, and variables and many other metadata specifications.

The tool provides both a command-line interface (CLI) and a Python API, supporting synchronous and asynchronous execution as well as multi-threaded crawling, facilitating integration into operational workflows. By normalising and indexing previously siloed metadata into searchable catalogues, the Metadata Crawler enhances data findability and empowers portals and analysis platforms to deliver efficient discovery services. Its modular design also allows deployment in diverse environments and easy extension to additional backends or indexing targets.

How to cite: Kadow, C., Bergemann, M., Hadizadeh, M., Reis, M., and Lucio Eceiza, E.: Accelerating Data Discovery: Automated, Scalable Harvesting and Indexing of Metadata Across Heterogeneous Storage Backends, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9951, https://doi.org/10.5194/egusphere-egu26-9951, 2026.