EGU25-16009, updated on 15 Mar 2025
https://doi.org/10.5194/egusphere-egu25-16009
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Advancing Crop Yield Predictions: The Potential of Diffusion Models in Machine Learning for Agriculture
Amit Kumar Srivastava1,8, Krishnagopal Halder1, Yue Shi2, Liangxiu Han2, Radwa EI Shawi3, Jan Timko3, Wenzhi Zheng4, Gang Zhao5, Karam Alsafadi6, Manmeet Singh7, Dominik Behrend8, Thomas Gaiser8, and Frank Ewert1,8
Amit Kumar Srivastava et al.
  • 1Leibniz Centre for Agricultural Landscape Research (ZALF), Data Analysis and Simulation - Multiscale Modelling and Forecasting, Muencheberg, Germany (amitkumar.srivastava@zalf.de)
  • 2Department of Computing, and Mathematics, Faculty of Science and Engineering, Manchester Metropolitan University John Dalton Building, Chester Street, Manchester, M1 5GD, UK
  • 3Data Systems Group Institute of Computer Science University of Tartu Narva mnt 18, Tartu 51008, Estonia
  • 4College of Agricultural Science and Engineering, Hohai University NO 8, West Focheng Road, Nanjing, Jiangsu Province, China
  • 5College of Soil and Water Conservation Science and Engineering, North A&F University, Yangling, Shaanxi, China
  • 6College of the Environment and Ecology, Xiamen, Fujian 361102, China
  • 7Department of Earth and Planetary Sciences, Jackson School of Geosciences, Austin, USA
  • 8Institute of Crop Science and Resource Conservation, Katzenburgweg 5, 53115 University of Bonn, Germany

The dual challenges of climate change and a growing population exceeding 9 billion by 2030 necessitate precise regional crop yield prediction models to optimize management, ensure food security, and guide agricultural decisions. Machine learning (ML), leveraging big data and high-performance computing, provides powerful tools for addressing these complexities but faces challenges such as inconsistent data quality and variable algorithm performance. While ML algorithms like Convolutional Neural Networks (CNNs), Random Forests (RF), and Long Short-Term Memory (LSTM) networks show promise in crop yield prediction, their performance can be hindered by data noise and incompleteness. Diffusion (a probabilistic generative model), with its iterative denoising capabilities, offers resilience to these issues and holds significant potential to improve accuracy and reliability in crop forecasting, though their use in this domain remains largely untapped.

This study compared XGBoost (XGB), a state-of-the-art tree-based ML model, with our proposed Diffusion-reg (DR) model. The input data for the models was compiled from multiple sources, including crop calendar data from MIRCA2000, net primary production (NPP) data from WAPOR, soil data from the Soil-Grids database, and maize crop yield data from the FAO database. Climate variables such as precipitation, air temperature, and solar radiation were obtained from ERA5, with all data aggregated into decadal periods. Additionally, Leaf Area Index (LAI) and Normalized Difference Vegetation Index (NDVI) data from MODIS were collected at 16-day intervals. In the subsequent step, maize yield data at the country level from the FAO was spatially disaggregated to produce pixel-scale estimates (250 m resolution, aligned with the soil input data resolution). This process focused exclusively on cropland areas within the five major maize-producing countries in Sub-Saharan Africa.

The evaluation of model performance metrics highlights the consistent superiority of the DR model over XGB across all analyzed countries. The R2 values, which measure the proportion of variance explained by the models, indicate higher predictive accuracy for Diffusion-reg in every instance. For example, in Ethiopia, the DR achieves an almost perfect R2 of 0.98 compared to XGB’s 0.95, while the largest gap is observed in South Africa, with R2 values of 0.86 for DR and 0.76 for XGB. These results highlight the DR model’s ability to effectively capture complex data patterns, even in regions with higher predictive challenges.
Further, the RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) metrics reinforce the DR model’s superior predictive precision. Across all countries, DR consistently exhibits lower error values, with Ethiopia showing the best performance (RMSE: 0.02, MAE: 0.01). Although South Africa records the highest RMSE (0.25) and MAE (0.13) for the DR model, these metrics still significantly outperform those of XGB. Similar trends in Uganda and Mozambique, where the DR model achieves substantial reductions in error, further validate its robustness and reliability.
In summary, the DR model consistently outperforms XGBoost in diverse regional contexts, highlighting its potential for broader application in predictive tasks requiring high accuracy and resilience.

How to cite: Srivastava, A. K., Halder, K., Shi, Y., Han, L., EI Shawi, R., Timko, J., Zheng, W., Zhao, G., Alsafadi, K., Singh, M., Behrend, D., Gaiser, T., and Ewert, F.: Advancing Crop Yield Predictions: The Potential of Diffusion Models in Machine Learning for Agriculture, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-16009, https://doi.org/10.5194/egusphere-egu25-16009, 2025.