- German Climate Computing Centre (DKRZ), Hamburg, Germany
The discoverability and access of climate and Earth-system datasets is foundational for effective scientific analysis workflows, yet these datasets are often hosted across diverse storage systems and follow a variety of organisational conventions. Researchers and infrastructure engineers face challenges in ingesting distributed metadata into unified, searchable catalogues without sacrificing interoperability or scalability. Efficient metadata harvesting, normalisation, and ingestion at scale are therefore critical enablers for data discovery and FAIR (Findable, Accessible, Interoperable, and Reusable) data practices.
To address this need, we present the Metadata Crawler, a metadata ingestion tool within the designed to automate the collection and indexing of climate dataset metadata across heterogeneous storage backends. The Metadata Crawler supports multi-backend discovery, including POSIX file systems, S3/MinIO object stores, and OpenStack Swift, enabling infrastructure administrators to aggregate metadata from local archives, cloud object storage, and institutional repositories.
At its core, the Metadata Crawler implements a two-stage pipeline: harvested metadata are first collected into a temporary catalogue, and then indexed into downstream systems such as Apache Solr or MongoDB. Dataset definitions, directory structures, and extraction logic are governed by a flexible TOML configuration that encodes Data Reference Syntax (DRS) dialects for different standards. Users can make use of pre-defined standards or define their own, making the tool extremely flexible and versatile. This schema-driven approach, combined with path and data specifications, conditional rules, and computed fields, ensures consistent representation of key facets such as temporal coverage, geospatial bounds, and variables and many other metadata specifications.
The tool provides both a command-line interface (CLI) and a Python API, supporting synchronous and asynchronous execution as well as multi-threaded crawling, facilitating integration into operational workflows. By normalising and indexing previously siloed metadata into searchable catalogues, the Metadata Crawler enhances data findability and empowers portals and analysis platforms to deliver efficient discovery services. Its modular design also allows deployment in diverse environments and easy extension to additional backends or indexing targets.
How to cite: Kadow, C., Bergemann, M., Hadizadeh, M., Reis, M., and Lucio Eceiza, E.: Accelerating Data Discovery: Automated, Scalable Harvesting and Indexing of Metadata Across Heterogeneous Storage Backends, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9951, https://doi.org/10.5194/egusphere-egu26-9951, 2026.