- Helmholtz Centre for Environmental Research - UFZ, Department of Aquatic Ecosystem Analysis and Management (ASAM), Magdeburg, Germany (younes.garosi@ufz.de)
Accurate estimation of soil organic carbon (SOC) content at large scales is very important for sustainable agriculture, climate change mitigation, and land management. This study was performed to consider the effect of using different soil sampling algorithms (SSA) for selecting optimal soil samples from the legacy soil datasets for predicting SOC content in the bare soil areas of the State of Bavaria, Germany. For this purpose, the matrix of soil samples alongside the corresponding values of covariates for each sample point was provided under three different scenarios. In the first scenario, which is the most commonly used scenario in digital soil mapping (DSM) studies, the values of each covariate at each soil sample location were captured from the exact pixel corresponding to the soil sample location (sample pixel). However, in the second and third scenarios, based on a filter-based parameterization, the covariate values at each soil sample location were calculated using 3 × 3 and 5 × 5 pixel windows, respectively. After providing the dataset for each scenario, three SSA including simple random sampling (SRS), conditioned Latin hypercube sampling (cLHS), and feature space coverage sampling (FSCS) were applied for selecting the optimal numbers of soil samples from each scenario to be used as the calibration dataset. In addition, those soil samples that were not selected as the calibration dataset were considered as the validation dataset. In fact, these SSA were applied to create four splitting ratios of calibration and validation (cal/val) datasets including 50–50, 60–40, 70–30, and 80–20. For each scenario, the splitting ratio of cal/val datasets using each SSA was provided 50 times to consider the deterministic ability of SSA to select the same soil samples across multiple repetitions. The random forest (RF) model was trained using the calibration datasets to predict the SOC content in the validation datasets for each scenario. The results of the performance analysis showed that the cLHS method with a splitting ratio of 80–20 from the second scenario outperformed other SSA and scenarios for predicting SOC content. The median of three statistical indices including root mean square error (RMSE (%)), coefficient of determination (R²), and mean error (ME) were 1.13, 0.73, and –0.07, respectively, for this selected SSA and the cal/val datasets from the second scenario. Therefore, the results of this study demonstrated that the type of SSA, the splitting ratio of cal/val datasets, and the parameterization of covariate values for the sample pixel could influence the prediction performance of the machine learning model for predicting SOC content. However, before generalizing these findings, more studies would be required using other SSA and different pixel windows around the sample pixel in different conditions (climate, soil types, and geology, etc.).
How to cite: Garosi, Y., Nussbaum, M., Htitiou, A., Gabriel, D., Rode, M., and Möller, M.: Selecting the best combination of different parameterizations of covariates and sampling algorithms for the spatial prediction of the soil organic carbon contents, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1458, https://doi.org/10.5194/egusphere-egu26-1458, 2026.