Scalable Efficient Compression in Large-Scale Earth Observation

Erik Scheurer; Jiangtao Wang; Rocco Sedona; Stefano Maurogiovanni; Benedikt Blumenstiel; Johannes Jakubik; Paolo Fraccaro; Thomas Brunschwiler; Stefan Kesselheim; Gabriele Cavallaro

doi:https://doi.org/10.5194/egusphere-egu25-19016

[Back] [Session ESSI1.9]

EGU25-19016, updated on 15 Mar 2025

https://doi.org/10.5194/egusphere-egu25-19016

EGU General Assembly 2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Poster | Monday, 28 Apr, 10:45–12:30 (CEST), Display time Monday, 28 Apr, 08:30–12:30

Hall X4, X4.65

Scalable Efficient Compression in Large-Scale Earth Observation

Erik Scheurer¹, Jiangtao Wang¹, Rocco Sedona¹, Stefano Maurogiovanni¹, Benedikt Blumenstiel², Johannes Jakubik², Paolo Fraccaro², Thomas Brunschwiler², Stefan Kesselheim¹, and Gabriele Cavallaro^1,3

Erik Scheurer et al.

¹Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany
²IBM Research, Zurich (Switzerland) and UK
³School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland

Earth observation (EO) yields large-scale, multimodal datasets collected from various satellite missions, DEMs, land-use data, and textual metadata. Foundation models like 4M (Massively Multimodal Masked Modeling) can learn a joint embedding space that bridges modality gaps, mitigates missing data issues, and facilitates partial spatio-temporal alignment [1]. However, directly training such foundation models on the vast, high-dimensional original EO datasets is not only computationally intensive but also imposes substantial demands on storage resources.

To address this, one can leverage VQ-VAE (Vector Quantized-Variational AutoEncoder) as neural compressors to transform high-dimension multimodal inputs into a few discrete indices, significantly reducing data volume while preserving critical information. By inverting the tokenization process, we can reconstruct the original high-dimensional data with minimal quality loss, aided by adversarial and perceptual losses that enhance reconstruction fidelity.

Traditional VQ-based approaches, however, face challenges such as inefficient codebook utilization and limited latent space representation. To overcome these, we propose scaling strategies that complement 4M’s tokenizer-based architecture. By expanding the codebook size, latent dimensions, and network depth, our method captures the complexity of EO modalities more effectively. Specifically, we employ spherical quantization techniques like Grouped Spherical Quantization (GSQ) to address limitations in traditional approaches [2]. GSQ constrains codebook vectors to a spherical surface, stabilizing training, preventing code collapse, and promoting uniform codebook usage. Unlike standard VQ, GSQ uses spherical initialization and normalization to maintain consistent distances among codebook entries, ensuring robust latent space coverage even under extreme compression or large codebooks. From our empirical and ablation studies, alternative methods like LFQ (Lookup-Free Quantization), FSQ (Finite Scalar Quantization), and RVQ (Residual Vector Quantizer) often exhibit limitations, such as tightly coupling the latent dimension to codebook size or relying on specialized training losses. In contrast, spherical-based techniques effectively decouple latent dimensions from codebook vocabulary, providing greater flexibility and scalability as data demands increase.

Our approach enables neural compressors to adapt to varying scales of compression and complexity without compromising performance. Comprehensive scalability experiments—examining large codebooks, deeper networks, and diverse compression ratios—assessed the generalizability of the proposed compression strategies and demonstrated their effectiveness on high-dimensional, large-scale EO data with minimal information loss. By integrating advanced compression techniques with scalable architectures, this framework establishes a robust foundation for addressing multimodal challenges in EO research that significantly reduces the difficulty of training foundation models on multimodal high-dimensional EO data.

References

[1] Mizrahi, D., Bachmann, R., Kar, O. F., Yeo, T., Gao, M., Dehghan, A., & Zamir, A. (2023). 4M: Massively Multimodal Masked Modeling (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2312.06647

[2] Wang, J., Qin, Z., Zhang, Y., Hu, V. T., Ommer, B., Briq, R., & Kesselheim, S. (2024). Scaling Image Tokenizers with Grouped Spherical Quantization (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2412.02632

Acknowledgments

This work is performed in the Embed2Scale (Earth Observation & Weather Data Federation With AI Embeddings) project, funded by the EU’s Horizon Europe program under Grant Agreement number 101131841.

How to cite: Scheurer, E., Wang, J., Sedona, R., Maurogiovanni, S., Blumenstiel, B., Jakubik, J., Fraccaro, P., Brunschwiler, T., Kesselheim, S., and Cavallaro, G.: Scalable Efficient Compression in Large-Scale Earth Observation, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-19016, https://doi.org/10.5194/egusphere-egu25-19016, 2025.