EGU25-4277, updated on 14 Mar 2025
https://doi.org/10.5194/egusphere-egu25-4277
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Poster | Thursday, 01 May, 10:45–12:30 (CEST), Display time Thursday, 01 May, 08:30–12:30
 
Hall X4, X4.75
BEACON Binary Format (BBF) - Optimizing data storage and access to large data collections
Tjerk Krijger1, Peter Thijsse2, Robin Kooyman3, and Dick Schaap4
Tjerk Krijger et al.
  • 1MARIS, Nootdorp, the Netherlands (tjerk@maris.nl)
  • 2MARIS, Nootdorp, the Netherlands (peter@maris.nl)
  • 3MARIS, Nootdorp, the Netherlands (robin@maris.nl)
  • 4MARIS, Nootdorp, the Netherlands (dick@maris.nl)

As part of European projects, such as EOSC related Blue-Cloud2026, EOSC-FUTURE and FAIR-EASE, MARIS has developed and demonstrated a software system called BEACON with a unique indexing system that can, on the fly with high performance, extract data subsets based on the user’s request from millions of heterogeneous observational data files. The system returns one single harmonised file as output, regardless of whether the input contains many different data types or dimensions. 

Since in many cases the original data collections that are imported in a BEACON installment contain millions of files (e.g. Euro-Argo, SeaDataNet, ERA5, World Ocean Database), it is hard to achieve fast responses. Next to this, these large collections also require a large storage capacity. To mitigate these issues, we wanted to optimize the internal file format that is used within BEACON. With the aim of reducing the data storage size and speeding up the data transfer, while guaranteeing that the information of the original data files is maintained. As a result, the BEACON software has included a unique file format called the “BEACON Binary Format (BBF)” that meets these requirements. 

The BBF is a binary data format that allows for storing multi-dimensional data as apache arrow arrays with zero deserialization costs. This means that computers can read the data stored on disk, as if it were computer memory, significantly reducing computational access time by eliminating the cost for a computer to translate what’s on disk, to computer memory.

Together with making the entire data format “non-blocking”, which means that all computer cores can access the file at the same time and simultaneously use the jump table to read millions of datasets in parallel. This enables a level of performance which reaches speeds of multiple GB/s, making the hardware the bottleneck instead of the software.

Furthermore, the format takes a unique approach to compressing data by adjusting the way it compresses and decompresses on a per dataset level. This means that every dataset is compressed in a slightly different manner, making it much more effective in terms of size reduction and time to decompress the data which can get close to the effective memory speed of a computer.

It does this while retaining full data integrity. No data is ever lost within this format, nor is any data adjusted. If one were to import a NetCDF file into BBF, one could fully rebuild the original NetCDF file from the BBF file itself. In the presentation the added benefits of using the BBF will be highlighted by comparing and benchmarking it to traditional formats such as NetCDF, CSV, ASCII, etc.

In January 2025, BEACON 1.0.0 was made publicly available as an open-source software, allowing everyone to set-up their own BEACON node to enhance the access to their data, while at the same time being able to reduce the storage size of their entire data collection without losing any information. More technical details, example applications and general information on BEACON can be found on the website https://beacon.maris.nl/.

How to cite: Krijger, T., Thijsse, P., Kooyman, R., and Schaap, D.: BEACON Binary Format (BBF) - Optimizing data storage and access to large data collections, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4277, https://doi.org/10.5194/egusphere-egu25-4277, 2025.