EGU25-6544, updated on 14 Mar 2025
https://doi.org/10.5194/egusphere-egu25-6544
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Poster | Tuesday, 29 Apr, 10:45–12:30 (CEST), Display time Tuesday, 29 Apr, 08:30–12:30
 
Hall X4, X4.15
PyActiveStorage:  Efficient distributed data analysis using Active Storage for HDF5/NetCDF4
Bryan N. Lawrence1, David Hassell1, Grenville Lister, Predoi Valeriu1, Scott Davidson2, Mark Goddard2, Matt Pryor2, Stig Telfer2, Konstantinos Chasapis3, and Jean-Thomas Acquaviva4
Bryan N. Lawrence et al.
  • 1NCAS, Department of Meteorology, University of Reading, UK (bryan.lawrence@ncas.ac.uk)
  • 2StackHPC, Bristol, UK
  • 3DDN, Germany
  • 4DDN, France

Active storage (also known as computational storage) has been a concept often proposed but not often delivered. The idea is that there is a lot of under-utilised compute power in modern storage systems, and this could be utilised to carry out some parts of data analysis workflows. Such a facillity would reduce the cost of moving data, and make distributed data analysis much more efficient.

For storage to be able to handle compute, either an entire compute stack has to be migrated to the storage (with all the problems around security and dependencies) or the storage has to offer suitable compute interfaces. Here we take the second approach, borrowing the concept of providing system reduction operations in the MPI interface of HPC systems, to define and implement a reduction interface for the complex layout of HDF5 (and NetCDF4) data.

We demonstrate a near-production quality deployment of the technology (PyActiveStorage) fronting JASMIN object storage, and describe how we have built a POSIX prototype. The first provides compute “near” the storage, the second is truly “in” the storage. The performance with the object store is such that for some tasks distributed workflows based on reduction operations on HDF5 data can be competitive with local workflow speeds, a result which has significant implications for avoiding expensive copies of data and unnecessary data movement. As a byproduct of this work, we have also upgraded a pre-existing pure python HDF5 reader to support lazy access, which opens up threadsafe read operations on suitable HDF5 and NetCDF4 data.

To our knowledge, there has previously been no previous practical demonstration of active storage for scientific data held in HDF5 files. While we have developed this technology with application in distributed weather and climate workflows, we believe it will find utility in a wide range of scientific workflows.

How to cite: Lawrence, B. N., Hassell, D., Lister, G., Valeriu, P., Davidson, S., Goddard, M., Pryor, M., Telfer, S., Chasapis, K., and Acquaviva, J.-T.: PyActiveStorage:  Efficient distributed data analysis using Active Storage for HDF5/NetCDF4, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6544, https://doi.org/10.5194/egusphere-egu25-6544, 2025.