Comparative analysis of Machine Learning models for predicting the trihalomethanes formation potential in a Drinking Water Treatment Plant in Spain

Mireia Pla-Castellana; Oriol Gutierrez; Jordi Raich-Montiu; Wolfgang Gernjak

doi:https://doi.org/10.5194/egusphere-egu24-10984

[Back] [Session GI6.4]

EGU24-10984, updated on 08 Mar 2024

https://doi.org/10.5194/egusphere-egu24-10984

EGU General Assembly 2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Comparative analysis of Machine Learning models for predicting the trihalomethanes formation potential in a Drinking Water Treatment Plant in Spain

Mireia Pla-Castellana¹, Oriol Gutierrez^2,4, Jordi Raich-Montiu³, and Wolfgang Gernjak^2,5

Mireia Pla-Castellana et al.

¹Karlsruhe Institut of Technology, Institute of Meteorology and Climate Research Atmospheric Environmental Research, Stuttgart, Germany (mireiapla1988@gmail.com)
²Catalan Institute for Water Research (ICRA), Girona, Spain
³s::can Iberia Sistemas de Medición, Barcelona, Spain
⁴Universitat de Girona (UdG), Girona, Spain
⁵Institució Catalana de Recerca i Estudis Avançats (ICREA), Spain

Trihalomethanes (THMs), which may be harmful to human health if ingested or inhaled, are produced when organic matter reacts with chlorine. Hence, their formation during potabilization requires to be controlled to ensure safe drinking water.

In this study, the predictive capacity of a Multiple Linear Regression (MLR) and an Artificial Neural Networks (ANN) models have been compared with real-time field-scale data of the THM formation potential (THM FP) from a Spanish Drinking Water Treatment Plant (DWTP). Spectral absorbance data obtained with Spectro::lyser® probes, installed in several treatment steps of the plant were the independent variables used to construct the models. Variable selection was based on the Stepwise Selection (SS) procedure.

Following the fitting of the investigated models, ANN demonstrated precise goodness of fit (R² = 0.92; RMSE = 0.77), clearly outperforming the MLR model (R² = 0.35; RMSE = 1.65). Severe multicollinearity among wavelengths is responsible for the model's accuracy difference. Even though it was reduced by a prior study on the Variance Inflation Factor (VIF), it was still very high for some of the remaining wavelengths. As a result of this effect, large fictitious correlations were produced, which adversely impacted the MLR model's prediction performance (R² = 0.30 in the validation set). While R² reduced, indicating perhaps a slight overtraining of the ANN, the resulting R² in the validation set (0.72) was still very high

This study proved that Machine Learning models such as Artificial Neural Networks based on spectral absorption data can enhance the ability of operators to respond to critical events, becoming a decisive component of the daily management of drinking water in DWTP when needed.

How to cite: Pla-Castellana, M., Gutierrez, O., Raich-Montiu, J., and Gernjak, W.: Comparative analysis of Machine Learning models for predicting the trihalomethanes formation potential in a Drinking Water Treatment Plant in Spain, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-10984, https://doi.org/10.5194/egusphere-egu24-10984, 2024.