Efficient Earth Observation Representation Learning Using Metadata-Aware Mixture-of-Experts Masked Autoencoder

Mohanad Albughdadi; Marica Antonacci; Vasileios Baousis; Federico Fornari; Tolga Kaprol; Claudio Pisa

doi:https://doi.org/10.5194/egusphere-egu26-2530

[Back] [Session ESSI1.11]

EGU26-2530, updated on 13 Mar 2026

https://doi.org/10.5194/egusphere-egu26-2530

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Efficient Earth Observation Representation Learning Using Metadata-Aware Mixture-of-Experts Masked Autoencoder

Mohanad Albughdadi¹, Marica Antonacci², Vasileios Baousis³, Federico Fornari², Tolga Kaprol¹, and Claudio Pisa²

Mohanad Albughdadi et al.

¹European Centre for Medium-Range Weather Forecasts, Bonn, Germany
²European Centre for Medium-Range Weather Forecasts, Bologna, Italy
³European Centre for Medium-Range Weather Forecasts, Reading, UK

Large-scale foundation models trained on multi-sensor satellite imagery has been driving recent advances in Earth Observation (EO) tasks. Although such models achieve impressive transferability across diverse downstream tasks, their computational and memory demands hinder accessibility, reproducibility, and deployment in resource-constrained environments. This work explores a compact and efficient alternative, introducing a metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) for EO representation learning (Albughdadi, 2025).

The proposed MoE-MAE is a self-supervised transformer-based architecture with only 2.5 million parameters. It combines sparse expert routing and geo-temporal conditioning. The sparse routing allows token specialization while keeping active computation low. The geo-temporal conditioning injects information about latitude, longitude, and cyclic temporal attributes directly into the model. The proposed design enables the algorithm to exploit spatial and temporal regularities inherent in EO data without requiring dense, and computationally costly transformers.

The model is pretrained in the BigEarthNet-Landsat (BEN-LS) (Corley et al., 2025) dataset using a masked reconstruction loss function augmented with auxiliary unmasked and load-balancing losses to encourage stable expert utilization. The learned encoder representations are then evaluated using linear probing on two benchmark datasets: (1) BEN-LS, a multi-label land-cover dataset with explicit metadata, and (2) EuroSAT-Landsat (EuroSAT-LS) (Corley et al., 2025), a single-label classification datasets without metadata. Despite the encoder’s small size (~2.3 M parameters), the proposed MoE-MAE achieves competitive results with models’ orders of magnitude larger. On BEN-LS, the frozen encoder achieves a micro mean average precision of 0.767, comparable to SSL4EO-L ViT-S/16 MoCo v2 (0.775) (Stewart et al., 2023). On EuroSAT-LS, the model maintains strong transferability, achieving 84.2% accuracy, even in the absence of geo-temporal metadata.

Expert specialization across spatial patterns is revealed through adequate ablation and visualization studies, which show that some experts respond primarily to vegetation, others to water or textured regions. This demonstrates interpretable behaviour and complementary feature learning. Additionally, only about half of the model’s expert feed-forward capacity is activated per token, confirming computational sparsity in practice. These findings suggests that such models can retain strong representational power while substantially reducing training and inference costs.

This work presents a first step toward small-scale architectures for EO representation learning by integrating metadata, and leveraging sparse computation to approach the performance of massive transformers. Future work will extend this framework to multi-sensor and multi-temporal datasets to capture dynamic Earth processes efficiently.

Albughdadi, M. (2025). Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation.arXiv:2509.10919.

Stewart, A. J., Lehmann, N., Corley, I. A., Wang, Y., Chang, Y.-C., Braham, N. A. A., Sehgal, S., Robinson, C., & Banerjee, A. (2023). SSL4EO-L: Datasets and Foundation Models for Landsat Imagery. arXiv:2312.05241.

Corley, I., Sharma, L., and Crasto, R. (2025). Landsat-Bench: Datasets and Benchmarks for Landsat Foundation Models.arXiv:2506.08780.

How to cite: Albughdadi, M., Antonacci, M., Baousis, V., Fornari, F., Kaprol, T., and Pisa, C.: Efficient Earth Observation Representation Learning Using Metadata-Aware Mixture-of-Experts Masked Autoencoder, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-2530, https://doi.org/10.5194/egusphere-egu26-2530, 2026.