EGU24-22153, updated on 11 Mar 2024
https://doi.org/10.5194/egusphere-egu24-22153
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Spatial cross-validation of wheat yield estimations using remote sensing and machine learning

Keltoum Khechba1,2, Mariana Belgiu2, Ahmed Laamrani1,3, Qi Dong4,1, Alfred Stein2, and Abdelghani Chehbouni1
Keltoum Khechba et al.
  • 1Center for Remote Sensing Applications, Mohammed VI Polytechnic University, Morocco (Keltoum.khechba@um6p.ma)
  • 2Department of Earth Observation Science, Faculty of Geo-Information Science and Earth Observation, The Netherlands
  • 3Department of Geography, Environment & Geomatics, University of Guelph, Guelph, Ontario, Canada
  • 4Key Laboratory of Earth Surface Processes and Resource Ecology, Faculty of Geographical Science, Beijing Normal University, China

Integration of Machine Learning (ML) with remote sensing data has been successfully used to create detailed agricultural yield maps at both local and global scales. Despite this advancement, a critical issue often overlooked is the presence of spatial autocorrelation in geospatial data used for training and validating ML models. Usually random cross-validation (CV) methods are employed that fail to account for this aspect. This study aimed to assess wheat yield estimations using both random and spatial CV. In contrast to random CV where the data is split randomly, spatial CV involves splitting the data based on spatial locations, to ensure that spatially close data points are grouped together, either entirely in the training or in the test set, but not both. Conducted in Northern Morocco during the 2020-2021 agricultural season, our research uses Sentinel 1 and Sentinel 2 satellite images as input variables as well as 1329 field data locations to estimate wheat yield. Three ML models were employed: Random Forest, XGBoost, and Multiple Linear Regression. Spatial CV was employed across varying spatial scales. The province represents predefined administrative division, while grid2 and grid1 are equally sized spatial blocks, with a spatial resolution of 20x20km and 10x10 km respectively. Our findings show that when estimating yield with Random CV, all models achieve higher accuracies (R² = 0.58 and RMSE = 840 kg ha-1 for the XGBoost model) as compared to the performance reported when using spatial CV. The10x10 km spatial CV led to the highest R² value equal to 0.23 and an RMSE value equal to 1140 kg ha-1 for the XGBoost model, followed by the 20x20 km grid-based strategy (R² = 0.11 and RMSE = 1227 kg ha-1 for the XGBoost model). Province-based spatial CV resulted in the lowest accuracy with an R² value equal to 0.032 and an RMSE value of 1282 kg ha-1. These results confirm that spatial CV is essential in preventing overoptimistic model performance. The study further highlights the importance of selecting an appropriate CV method to ensure realistic and reliable results in wheat yield predictions as increased accuracy can deviate from real-world conditions due to the effects of spatial autocorrelation.  

How to cite: Khechba, K., Belgiu, M., Laamrani, A., Dong, Q., Stein, A., and Chehbouni, A.: Spatial cross-validation of wheat yield estimations using remote sensing and machine learning, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-22153, https://doi.org/10.5194/egusphere-egu24-22153, 2024.