- Institute of Bio, and Geosciences, IBG-3 Agrosphere, Forschungszentrum Jülich GmbH, Jülich D-52425, Germany
Machine-learning (ML) models are increasingly used to predict environmental process parameters from literature-derived datasets. A common but rarely scrutinized practice is random data splitting for model training and evaluation, which implicitly assumes independence among samples. However, environmental datasets often contain strong group structures arising from shared soil sources. Samples originating from the same soil may share substantial, unquantified microbial information, including community composition, functional potential, and legacy effects, which cannot be fully represented by standard physicochemical descriptors. Ignoring such group structure may therefore induce group-wise information leakage and lead to overoptimistic assessments of model performance.
Here, we systematically examine the consequences of random versus group-wise data splitting for ML-based prediction of distribution coefficient (Kd) and first-order degradation rate constants (μ) of atrazine in soils. A dataset was compiled from published batch experiments and incubation studies, comprising 306 datasets from 205 distinct soils (adsorption) and 329 datasets derived from 77 distinct soil sources (degradation); grouping was defined exclusively based on shared soil sources. This grouping strategy explicitly reflects the presence of latent microbial controls that remain unobservable to the model. ML models were trained using identical algorithms but evaluated under two contrasting strategies: (i) conventional random splitting that ignores soil-based group structure, and (ii) group-wise splitting that enforces complete separation of soil sources between training and testing sets.
Taking atrazine degradation as an example, under random splitting, models exhibit apparently strong predictive performance, characterized by near-zero mean bias and inflated coefficients of determination (R² against the 1:1 line = 0.835; RMSE = 0.037; MAE = 0.019). In contrast, group-wise splitting reveals a pronounced degradation in performance, with the coefficient of determination against the 1:1 line dropping to R² = 0.099, accompanied by substantially increased errors (RMSE = 0.093; MAE = 0.053) and systematic overestimation of μ, reflected by a positive bias of 0.013 (≈ 24%). A similar pattern emerges for atrazine adsorption. These results demonstrate that random data splitting can fundamentally overstate the predictive capability of ML models trained on literature-derived soil datasets when shared soil sources are present. Therefore, we argue that soil-based group-wise evaluation is essential for ensuring robust assessment of model generalizability in data-driven studies of soil biogeochemical processes.
How to cite: Chen, F. and Vanderborght, J.: Random data splitting of literature-derived data ignoring group structure leads to group-wise information leakage in machine-learning models: Evidence from atrazine adsorption and degradation in soils, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19636, https://doi.org/10.5194/egusphere-egu26-19636, 2026.