EGU23-17494
https://doi.org/10.5194/egusphere-egu23-17494
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Enabling simple access to a data lake both from HPC and Cloud using Kerchunk and Intake

Thierry Carval1, Erwan Bodere1, Julien Meillon1, Mathiew Woillez2, Jean Francois Le Roux3, Justus Magin3, and Tina Odaka3
Thierry Carval et al.
  • 1IRSI (Department of Marine and Digital Infrastructures), IFREMER, Plouzane, France
  • 2DECOD (Ecosystem Dynamics and Sustainability), IFREMER-Institut Agro-INRAE, Plouzane, France
  • 3LOPS (Laboratoire d'Oceanographie Physique et Spatiale UMR 6523), CNRS-IFREMER-IRD-Univ.Brest-IUEM, Plouzane, France

We are experimenting with hybrid access from Cloud and HPC environments using the Pangeo platform to make use of a data lake in an HPC infrastructure “DATARMOR”.  DATARMOR is an HPC infrastructure hosting ODATIS services (https://www.odatis-ocean.fr) situated at “Pôle de Calcul et de Données pour la Mer” in IFREMER. Its parallel file system has a disk space dedicated for shared data, called “dataref”.  Users of DATARMOR can access these data, and some of those data are cataloged by sextant service (https://sextant.ifremer.fr/Ressources/Liste-des-catalogues-thematiques/Datarmor-Donnees-de-reference ) and is open and accessible from the internet, without duplicating the data. 

In the cloud environment, the ability to access files in a parallel manner is essential for improving the speed of calculations. The Zarr format (https://zarr.readthedocs.io) enables parallel access to data sets, as it consists of numerous chunked “object data” files and some “metadata” files. Although it enables multiple data access, it is simple to use since all the collections of data stored in a Zarr format are accessible through one access point.  

For HPC centers, the numerous “object data” files create a lot of metadata on parallel file systems, slowing the data access time. Recent progress on development of Kerchunk (https://fsspec.github.io/kerchunk/), which recognize the chunks in a file (e.g. NetCDF / HDF5) as a Zarr chunk and its capability to recognize a series of files as one Zarr file, is solving these technical difficulties in our PANGEO use cases at DATARMOR. Thanks to Kerchunk and Intake (https://intake.readthedocs.io/) it is now possible to use different sets of data stored in DATARMOR in an efficient and simple manner.    

We are further experimenting with this workflow using the same use cases on the PANGEO-EOSC cloud.   We make use of the same data stored at the data lake in DATARMOR, but based on Kerchunk and Intake catalog through ODATIS access, without duplicating the source data. In the presentation we will share our recent experiences from these experiments. 

How to cite: Carval, T., Bodere, E., Meillon, J., Woillez, M., Le Roux, J. F., Magin, J., and Odaka, T.: Enabling simple access to a data lake both from HPC and Cloud using Kerchunk and Intake, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-17494, https://doi.org/10.5194/egusphere-egu23-17494, 2023.