Learning rich and robust representations of Earth Observation (EO) data is critical for effective and accessible geoanalytics. While the ever-growing volume of EO data suggests high potential for self-supervised learning, most approaches are limited to fixed scales, resolutions, or modalities—thus failing to generalize beyond their original sensor configurations. To address these shortcomings, we introduce AnySat, a novel multimodal framework capable of self-supervised training on multiple, diverse EO datasets simultaneously.
AnySat’s design centers on two key innovations. First, we propose a Joint Embedding Predictive Architecture (JEPA) adapted for multimodal EO. Unlike pixel-level reconstruction methods, JEPA operates in latent space—making it inherently more resilient to cloud cover, time-of-day shifts, and varying acquisition angles. Second, scale-adaptive spatial encoders allow a single network to handle variable spatial and temporal resolutions. Notably, more than 75% of AnySat’s 100M parameters are shared across all supported modalities, scales, and resolutions, enabling the model to fully exploit diverse training corpora—a fundamental requirement for developing a true EO foundation model.
To train AnySat, we compile GeoPlex, a collection of five multimodal datasets (PASTIS-HD, TreeSatAI-TS, PLANTED, FLAIR, and S2NAIP), aiming for diversity: 11 distinct sensors including radar and optical modalities, 0.2–250 m resolution, single-image and time series, and 0.3-2600 ha per input sample. Thanks to its versatility, a single Anysat model can learn powerful representations by training from all five datasets simultaneously. We only use cross-modal alignment as a source of self-supervision, and do not require labels for pretraining.
We fine-tune and evaluate our model on the datasets of GeoPlex, as well as four external datasets to evaluate generalization. We report state-of-the-art results on seven downstream tasks, including land cover mapping, crop-type classification, tree-species identification, deforestation detection, and disaster mapping. Notably, AnySat yields significant performance gains across multiple benchmarks, such as +2.8 mIoU on PASTIS-HD, +3.6 mIoU on SICKLE, +11.0 accuracy on TimeSen2Crop, and +10.2 IoU on BraDD-S1TS.
A major benefit of AnySat is its high performance when performing linear probing with fixed representations—even for semantic segmentation tasks. This combination of versatility, generalizability, and ease of use positions AnySat as a valuable tool for practitioners facing diverse sensor types, specialized data distributions, and limited annotations.