Transformer-Based Adaptive Multimodal Fusion Model for Remote Sensing Winter Wheat Yield Prediction

Haoran Meng; Joel Segarra; Shawn Carlisle Kefauver; José Luis Araus Ortega

doi:https://doi.org/10.5194/egusphere-egu26-12595

[Back] [Session BG9.9]

EGU26-12595, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-12595

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Transformer-Based Adaptive Multimodal Fusion Model for Remote Sensing Winter Wheat Yield Prediction

Haoran Meng, Joel Segarra, Shawn Carlisle Kefauver, and José Luis Araus Ortega

Haoran Meng et al.

Large-scale and highly accurate wheat yield prediction is of great importance for ensuring food security, supporting agricultural policymaking, and guiding grain allocation. In recent years, the rapid development of remote sensing technologies and deep learning algorithms has provided powerful tools for large-scale crop yield prediction. However, crop yield is jointly influenced by multiple environmental factors, such as climate, soil, and topography. Existing studies often adopt simple feature concatenation or fixed-weight fusion strategies, lacking adaptive modeling of relative modality importance, which limits further improvement in prediction accuracy. To address this issue, this study proposes a Transformer-based multimodal adaptive Gated Fusion model (TMMGF). The model employs Transformers to model dynamic time series of remote sensing spectral data and climate variables, applies multilayer perceptrons (MLP) to handle static environmental factors including soil and topography. Multiple modalities are then integrated through a gated fusion mechanism to achieve adaptive weighted fusion. This study was conducted across the conterminous United States, based on county-level winter wheat yield records from 2008 to 2023. The TMMGF was systematically compared with an LSTM-based multimodal adaptive Gated Fusion model (MMGF), Transformer single-modal remote sensing model, Transformer single-modal climate model, MLP single-modal soil model, and MLP single-modal topography model. The results show that TMMGF achieves the best performance, with an average R² of 0.813, RMSE of 0.571 t/ha, and MAPE of 14.49% in 10-fold cross-validation, significantly outperforming the baseline models. In particular, compared with the LSTM-based multimodal model MMGF (R² = 0.796, RMSE = 0.598 t/ha, MAPE = 15.11%), TMMGF shows clear advantages in both accuracy and stability. This study demonstrates that a Transformer-based adaptive multimodal fusion framework can effectively integrate heterogeneous data sources and provides a promising technical pathway for high-accuracy large-scale wheat yield prediction.

How to cite: Meng, H., Segarra, J., Kefauver, S. C., and Araus Ortega, J. L.: Transformer-Based Adaptive Multimodal Fusion Model for Remote Sensing Winter Wheat Yield Prediction, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12595, https://doi.org/10.5194/egusphere-egu26-12595, 2026.