- Hydrosat, Science, Luxembourg (jvinholi@hydrosat.com)
At continental scale, crop classification needs models that capture phenology through temporal analysis without degrading field boundaries. We introduce a decoupled architecture that uses static foundation‑model features across multi‑sensor time series and fuses them with high‑resolution spatial features. The temporal stream ingests paired multispectral and SAR sequences plus a static DEM and metadata, extracts foundation model token features per timestep, and compresses them with a Perceiver‑style bottleneck that cross attends from a fixed latent bank to the full foundation model token volume. Such heavy compression collapses sequence length by orders of magnitude, which makes longer temporal windows and larger batches ingestible on consumer‑grade GPU memory constraints while preserving the temporal signatures needed to separate crops with similar single‑date appearance.
The spatial stream stays purely static --- it selects a single high‑quality multispectral reference frame and passes it through a high‑resolution backbone to retain fine geometry and crisp boundaries. The two streams are joined in a query‑based decoder, where dynamic queries generated from the compressed temporal latents attend to multi‑scale spatial features, aligning phenological signatures with precise field edges. This fusion mechanism prevents coarse temporal features from blurring geometry and makes delineation robust to shifts in timing or crop management practice. In fact, temporal queries encode crop‑specific growth signatures, while the spatial stream supplies the pixel‑level evidence for boundary localization, whereas the decoder enforces instance‑aware segmentation through iterative cross‑attention and masked refinement.
We evaluate on EuroCrops crop‑class labels, achieving a Micro Recall of 84.1% and a Segmentation Quality of 84.2%. Transferability is tested with a spatial holdout protocol using geographically disjoint train/test regions, reliability is summarized by aggregate metrics on these strict splits, and uncertainty is communicated through per‑class performance variability and label‑noise sensitivity analyses that bound achievable scores.
How to cite: Vinholi, J. G., Sleimi, R., Werner, F., and Abelló, A.: A Two‑Stream Spatiotemporal Architecture with Foundation‑Model Features Applied to Crop Classification, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19669, https://doi.org/10.5194/egusphere-egu26-19669, 2026.