EGU26-12595, updated on 14 Mar 2026
https://doi.org/10.5194/egusphere-egu26-12595
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Poster | Friday, 08 May, 10:45–12:30 (CEST), Display time Friday, 08 May, 08:30–12:30
 
Hall X1, X1.69
Transformer-Based Adaptive Multimodal Fusion Model for Remote Sensing Winter Wheat Yield Prediction
Haoran Meng, Joel Segarra, Shawn Carlisle Kefauver, and José Luis Araus Ortega
Haoran Meng et al.

Large-scale and highly accurate wheat yield prediction is of great importance for ensuring food security, supporting agricultural policymaking, and guiding grain allocation. In recent years, the rapid development of remote sensing technologies and deep learning algorithms has provided powerful tools for large-scale crop yield prediction. However, crop yield is jointly influenced by multiple environmental factors, such as climate, soil, and topography. Existing studies often adopt simple feature concatenation or fixed-weight fusion strategies, lacking adaptive modeling of relative modality importance, which limits further improvement in prediction accuracy. To address this issue, this study proposes a Transformer-based multimodal adaptive Gated Fusion model (TMMGF). The model employs Transformers to model dynamic time series of remote sensing spectral data and climate variables, applies multilayer perceptrons (MLP) to handle static environmental factors including soil and topography. Multiple modalities are then integrated through a gated fusion mechanism to achieve adaptive weighted fusion. This study was conducted across the conterminous United States, based on county-level winter wheat yield records from 2008 to 2023. The TMMGF was systematically compared with an LSTM-based multimodal adaptive Gated Fusion model (MMGF), Transformer single-modal remote sensing model, Transformer single-modal climate model, MLP single-modal soil model, and MLP single-modal topography model. The results show that TMMGF achieves the best performance, with an average R² of 0.813, RMSE of 0.571 t/ha, and MAPE of 14.49% in 10-fold cross-validation, significantly outperforming the baseline models. In particular, compared with the LSTM-based multimodal model MMGF (R² = 0.796, RMSE = 0.598 t/ha, MAPE = 15.11%), TMMGF shows clear advantages in both accuracy and stability. This study demonstrates that a Transformer-based adaptive multimodal fusion framework can effectively integrate heterogeneous data sources and provides a promising technical pathway for high-accuracy large-scale wheat yield prediction.

How to cite: Meng, H., Segarra, J., Kefauver, S. C., and Araus Ortega, J. L.: Transformer-Based Adaptive Multimodal Fusion Model for Remote Sensing Winter Wheat Yield Prediction, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12595, https://doi.org/10.5194/egusphere-egu26-12595, 2026.