EGU26-18064, updated on 14 Mar 2026
https://doi.org/10.5194/egusphere-egu26-18064
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Monday, 04 May, 09:45–09:55 (CEST)
 
Room -2.92
Virtual Zarr for Ensemble Prediction Systems: VirtualiZarr Custom Parsers for Cloud-Native GRIB Access 
Hillary Koros, Nishadh Kalladath, Max Jones, Sean Harkins, Jason Kinyua, Mark Lelaono, Ezra Limo, Masilin Gudoshava, and Ahmed Amdihun
Hillary Koros et al.
  • IGAD-ICPAC, Disaster Risk Management, Nairobi, Kenya (hillary.koros@igad.int)

Virtual Zarr for Ensemble Prediction Systems: VirtualiZarr Custom Parsers for Cloud-Native GRIB Access 

Hillary Koros, Nishadh Kalladath, Max Jones, Sean Harkins, Jason Kinyua, Mark Lelaono, Ezra Kiplimo Masilin Gudoshava and Ahmed Amdihun 

IGAD Climate Prediction and Applications Centre, Nairobi, Kenya 

Development Seed, United States of America 

 

Global Ensemble Prediction Systems (EPS) from ECMWF and NOAA such as IFS, GEFS generate petabyte-scale datasets essential for early warning systems, probabilistic forecasting, and AI/ML weather applications. However, the GRIB format designed for efficient archival storage—resists cloud-native random access patterns. Converting archives to Analysis Ready Cloud Optimized (ARCO) formats would require prohibitive storage duplication. Virtual Zarr datasets enabled by Virtualizarr library offer a transformative alternative: lightweight reference layers exposing original GRIB files through cloud-native interfaces without data conversion. 

This approach creates a win-win-win solution. Data producers maintain GRIB files without additional processing. Cloud providers serve data efficiently through byte-range requests. End users access ensemble forecasts via familiar tools (xarray, Dask) as if data were in Zarr format. Previous work on Grib-Index-Kerchunk (https://github.com/icpac-igad/grib-index-kerchunk ) method demonstrated this paradigm by exploiting a critical insight: GRIB index files (.idx text for GEFS, .index JSON for ECMWF) contain all byte offset information needed for virtual reference creation. Rather than scanning entire corpus of GRIB files— compute expensive at ~2,400 files per GEFS run or ~85 files of 5GB each for ECMWF—the GIK method reads only lightweight index files (~KB/ few MB each) plus 1-2 sample GRIB files to extract metadata structure. This achieves regional data access with less than 5% of original GRIB data read. 

 Building on this foundation, we develop GEFS and ECMWF custom parsers following the VirtualiZarr Parser protocol with native Zarr v3 ArrayBytesCodec using gribberish, a Rust-based decoder delivering order-of-magnitude performance improvements. Following HRRRparser (https://github.com/virtual-zarr/hrrr-parser ) patterns, our parsers construct chunk manifest store. Virtual references persist to Icechunk transactional storage following zarr specification, enabling version-controlled datasets where chunks reference original GRIB bytes. The resulting stores integrate with xarray and Dask for parallel ensemble processing across 30-51 members and 85+ forecast timesteps. 

For regional climate centers, this replaces custom pipelines with community-extensible parsers. By contributing GEFS or IFS product-specific custom parsers to VirtualiZarr, we transform operational necessity into reusable infrastructure—enabling cloud-native ensemble access: `xr.open_zarr("icechunk://gefs")`. 

How to cite: Koros, H., Kalladath, N., Jones, M., Harkins, S., Kinyua, J., Lelaono, M., Limo, E., Gudoshava, M., and Amdihun, A.: Virtual Zarr for Ensemble Prediction Systems: VirtualiZarr Custom Parsers for Cloud-Native GRIB Access , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18064, https://doi.org/10.5194/egusphere-egu26-18064, 2026.