EGU2020-5414, updated on 12 Jun 2020
EGU General Assembly 2020
© Author(s) 2020. This work is distributed under
the Creative Commons Attribution 4.0 License.

Predictive modelling of groundwater nitrate pollution at a regional scale using machine learning and feature selection

Aaron Cardenas-Martinez1, Victor Rodriguez-Galiano1, Juan Antonio Luque-Espinar2, and Maria Paula Mendes3
Aaron Cardenas-Martinez et al.
  • 1Universidad de Sevilla, Geografía Física y Análisis Geográfico Regional, 41004 Sevilla, España (
  • 2Instituto Geológico y Minero de España (IGME), Granada, España
  • 3Civil Engineering Research and Innovation for Sustainability (CERIS), Instituto Superior Técnico, Universidade de Lisboa, Portugal

The establishment of the sources and driven-forces of groundwater nitrate pollution is of paramount importance, contributing to agro-environmental measures implementation and evaluation. High concentrations of nitrates in groundwater occur all around the world, in rich and less developed countries.

In the case of Spain, 21.5% of the wells of the groundwater quality monitoring network showed mean concentrations above the quality standard (QS) of 50 mg/l. The objectives of this work were: i) to predict the current probability of having nitrate concentrations above the QS in Andalusian groundwater bodies (Spain) using past time features, being some of them obtained from satellite observations; ii) to assess the importance of features in the prediction; iii) to evaluate different machine learning approaches (ML) and feature selection techniques (FS).

Several predictive models based on an ML algorithm, the Random Forest, were used, as well as, FS techniques. 321 nitrate samples and respective predictive features were obtained from different groundwater bodies. These predictive features were divided into three groups, regarding their focus: agricultural production (phenology); livestock pressure (excretion rates); and environmental settings (soil characteristics and texture, geomorphology, and local climate conditions). Models were trained with the features of a year [YEAR (t0)], and then applied to new features obtained for the next year – [YEAR(t0+1)], performing k-fold cross-validation. Additionally, a further prediction was carried out for a present time – [YEAR(t0+n)], validating with an independent test. This methodology examined the use of a model, trained with previous nitrates concentrations and predictive features, for the prediction of current nitrates concentrations based on present features. Our findings showed an improvement in the predictive performance when using a wrapper with sequential search for FS when compared to the use alone of the Random Forest algorithm. Phenology features, derived from remotely sensed variables, were the most explanative features, performing better than the use of static land-use maps or vegetation index images (e.g., NDVI). They also provided much more comprehensive information, and more importantly, employing only extrinsic features of groundwater bodies.

How to cite: Cardenas-Martinez, A., Rodriguez-Galiano, V., Luque-Espinar, J. A., and Mendes, M. P.: Predictive modelling of groundwater nitrate pollution at a regional scale using machine learning and feature selection, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-5414,, 2020

Comments on the presentation

AC: Author Comment | CC: Community Comment | Report abuse

Presentation version 2 – uploaded on 03 May 2020
  • CC1: Comment on EGU2020-5414, Joseph Guttman, 04 May 2020
    1. Is there a similar behavoir (trend) between chloride and nitrate in wells. It can show that both coming from the same source
    2. I recommend to check if there is a correlation between the pumping depth (top of screen) and the nitrate concentration. In case of high correlation, it will show you that the pumping depth is an important parameter that must be taken in consideration in the model. 
    • AC1: Reply to CC1, Victor Rodriguez-Galiano, 04 May 2020

      Thanks for your comment. Yes, that could contribute to explain nitrate concentrations. However, this a regional study that uses many different wells but with limited information. The dataset is at country level, and we do not have access to certain information like well depth. This is why we decided to build our model using extrinsic features , i.e. remote sensing to account for preassures from agriculture. 

      • CC2: Reply to AC1, Joseph Guttman, 04 May 2020

        Thanks for the quick replay. I understand your limitation with the data avilablity. From my experience I think that my suggestion can be helpful and I recommend to put some pressure on the athorities to provide you with technical data from wells

Presentation version 1 – uploaded on 03 May 2020 , no comments