As Earth System Sciences (ESS) datasets from high-resolution models reach petabyte scales, the scientific community encounters severe constraints in storage, transfer efficiency, and data accessibility. Identifying the right parameters for high compression ratios with strict scientific fidelity within the vast ecosystem of lossy and lossless compression algorithms is a complex and delicate technical challenge.
We present dc_toolkit (https://github.com/C2SM/data-compression): an open-source, parallelized pipeline designed to support researchers navigate through this complex landscape. It leverages a set of user-friendly, customizable command-line tools to allow users to make informed, data-driven decisions. By systematically evaluating over 40,000 combinations of compressors, filters, and serializers, it autonomously identifies the most suitable configuration for both structured and unstructured data with single or multiple variables.
The workflow comprises a three-stage approach: (1) Evaluation & Optimization: the toolkit leverages parallel processing (via Dask and mpi4py) to rapidly evaluate combinations while filtering out those that violate scientific precision requirements and user-defined error tolerances (L-norms). (2) Analysis & Visualization: to help scientists analyze the trade-offs between data reduction and information loss, the tool performs k-means clustering on the outputs to display clear and organized results. Furthermore, it provides spatial error plotting to verify that domain-specific features (such as periodicity in global grids) are preserved. (3) Application & Interoperability: once the user has decided on a specific configuration, the toolkit handles the high-throughput compression of the dataset into Zarr-based storage. It ensures seamless integration into existing workflows by including utilities for a variety of actions such as inspecting compressed files and converting compressed data back to standard NetCDF format.
By providing a streamlined, automated, and verifiable method for selecting compression parameters, dc_toolkit lowers the entry barrier for lossy compression. It allows ESS researchers to more easily apply data reduction strategies with the confidence that the integrity of their downstream analysis remains intact. Accessibility is further enhanced through available web-based tools and GUI implementations for diverse user technicalities.
How to cite: Farabullini, N. and Kotsalos, C.: dc_toolkit: A parallelized pipeline to navigate the complex ecosystem of compression algorithms, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1872, https://doi.org/10.5194/egusphere-egu26-1872, 2026.