ITS1.12/HS12.1 | Handling data imperfections across disciplines: a panorama of current practices and new avenues in Geosciences
Orals |
Wed, 16:15
Wed, 14:00
EDI
Handling data imperfections across disciplines: a panorama of current practices and new avenues in Geosciences
Convener: Nanee Chahinian | Co-conveners: Franco Alberto Cardillo, Minh Thu Tran Nguyen, Jeremy Rohmer, Carole Delenne
Orals
| Wed, 30 Apr, 16:15–18:00 (CEST)
 
Room 2.17
Posters on site
| Attendance Wed, 30 Apr, 14:00–15:45 (CEST) | Display Wed, 30 Apr, 14:00–18:00
 
Hall A
Orals |
Wed, 16:15
Wed, 14:00

Orals: Wed, 30 Apr | Room 2.17

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
Chairperson: Nanee Chahinian
16:15–16:20
16:20–16:40
16:40–16:50
|
EGU25-12477
|
ECS
|
On-site presentation
Muhammed Denizoğlu, İsmail Sezen, Ali Deniz, and Alper Ünal

Conducting accurate air quality measurements is of critical importance for sustaining environmental and public health; however, gaps due to various reasons in respective datasets often undermine the reliability of subsequent processes.This study, therefore, aims at presenting a novel hybrid methodology that leverages the Optuna framework to optimize the hyperparameters of the Extreme Gradient Boosting (XGBoost) model for imputing missing data within one of the most significant indicators of air quality, namely PM2.5 data. The proposed approach was systematically evaluated under varying data loss scenarios, using synthetic datasets generated under the Missing Completely at Random (MCAR) mechanism with missing rates of 5%, 10%, 20%, and 30%. Traditional interpolation methods (such as linear and spline) and widely adopted machine learning techniques (i.e., random forest, multivariate adaptive regression splines) were also utilized to not only benchmarking but also ensuring a comparative environment. In this sense, three experimental configurations were examined: (1) imputation based solely on the PM2.5 time series, (2) integration of ERA5 reanalysis covariates and (3) inclusion of data from neighboring monitoring stations. The results indicate that the XGBoost-Optuna model outperformed its counterparts across all missing data scenarios, with R2 values of 0.852, 0.874, 0.862, and 0.866 for missing rates of 5%, 10%, 20%, and 30%, respectively. These findings highlight the potential of the XGBoost-Optuna model as a robust tool for handling missing air quality data, ensuring enhanced accuracy across varying data gaps and scenarios.

How to cite: Denizoğlu, M., Sezen, İ., Deniz, A., and Ünal, A.: A Novel Hybrid Approach for Missing PM2.5 Data Imputation Using Optuna-Optimized Extreme Gradient Boosting, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-12477, https://doi.org/10.5194/egusphere-egu25-12477, 2025.

16:50–17:00
|
EGU25-2634
|
On-site presentation
Ju-Yong Lee, Seung-Hee Han, Kwon Jang, Kyung-Hui Wang, Hui-Young Yun, and Dae-Ryun Choi

PM-2.5 is a critical pollutant for air quality evaluation and public health policymaking, necessitating accurate data for reliable analysis. However, environmental data often contain missing values due to equipment malfunctions or extreme weather conditions, which undermine the credibility of analysis and predictions. In particular, the frequent fluctuations of PM-2.5 levels in Seoul highlight the importance of addressing missing data issues.

This study systematically compares the performance of various missing data imputation methods for PM-2.5 data in Seoul, aiming to identify the optimal approach for medium- and long-term predictions. By generating and evaluating missing data during high- and low-concentration periods, this research differentiates itself from prior studies and enhances practical applicability.

A range of statistical and machine learning-based methods, including FFILL, KNN, MICE, SARIMAX, DNN, and LSTM, were applied to impute missing data. The performance of each method was evaluated over 6-hour, 12-hour, and 24-hour intervals using metrics such as RMSE, MAE, and correlation coefficients. The experimental design incorporated real-world air quality conditions by selecting data from periods of significant PM-2.5 variation.

KNN demonstrated balanced performance across all time intervals and yielded the best results for medium- and long-term predictions. FFILL showed excellent accuracy over short time intervals but exhibited declining performance as the interval length increased. Conversely, deep learning-based models, such as DNN and LSTM, showed relatively poor performance, indicating the need for further optimization to account for the characteristics of time-series data.

This study confirms that KNN is the most suitable method for PM-2.5 missing data imputation due to its simplicity and computational efficiency. These findings enhance the reliability of air quality data analysis and provide a valuable foundation for effective air quality management and policymaking. Furthermore, the results underscore the importance of selecting appropriate imputation methods to improve predictive accuracy and analytical reliability.

"This research was supported by Particulate Matter Management Specialized Graduate Program through the Korea Environmental Industry & Technology Institute(KEITI) funded by the Ministry of Environment(MOE)“

 

How to cite: Lee, J.-Y., Han, S.-H., Jang, K., Wang, K.-H., Yun, H.-Y., and Choi, D.-R.: Comparison of Models for Missing Data Imputation in Environmental Data: A Case Study of PM-2.5 in Seoul, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-2634, https://doi.org/10.5194/egusphere-egu25-2634, 2025.

17:00–17:10
|
EGU25-5609
|
ECS
|
On-site presentation
Jose Araya, Yiannis Proestos, and Jos Lelieveld

With the advent of Machine Learning methods and the development of new techniques in data mining, knowledge representation and data extraction, new possibilities have emerged to address the shortcomings of data imperfection. In this context, there are different methods for producing synthetic time series, which vary across goals and disciplines. In certain situations, it can be challenging to obtain the relevant data required to test assumptions about the skill and performance of machine learning models. Synthetic data generation approaches provide an effective solution by enabling the testing of machine learning algorithms in the absence of real data.

Although data availability is seemingly ubiquitous these days, a paradox arises in situations where bureaucratic, practical, or technical limitations make it difficult for researchers to rely on the required data, particularly when accessing real measurements (e.g., time series data) for specific purposes.

Our preliminary study features a case in operational meteorology where synthetic data proves particularly useful, addressing challenges associated with limited or inaccessible real measurements. Specifically, we investigate the capability of machine learning algorithms to generate high-quality synthetic time series that can be applied in meteorological data processing and analysis. To achieve this, synthetic datasets were developed based on informed criteria that integrate dynamical features of near-surface temperature data, tailored to the unique geographic and environmental context of Cyprus. These criteria include key characteristics such as trends, extreme values, diurnal cycles and vertical temperature gradients, ensuring a realistic and comprehensive representation of near-surface temperature behavior. This approach facilitates the testing and validation of data-driven models in operational settings, providing a robust framework for evaluating their performance under controlled, yet realistic, conditions.

We characterized the general features of these synthetic datasets and evaluated their utility as benchmarks for data quality control purposes. Our findings underscore the potential value of synthetic datasets in operational meteorology, particularly in supporting the development and evaluation of robust, purpose-specific, machine learning algorithms. 

How to cite: Araya, J., Proestos, Y., and Lelieveld, J.: Use of synthetic time series datasets for quality control of meteorological data. , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5609, https://doi.org/10.5194/egusphere-egu25-5609, 2025.

17:10–17:20
|
EGU25-8379
|
ECS
|
On-site presentation
Omar Et-targuy, Carole Delenne, Ahlame Begdouri, and Salem Benferhat

Wastewater networks are inherently interconnected systems, yet the Shapefile model commonly used in Geographic Information Systems (GIS) fails to adequately represent their connectivity. This limitation arises from the non-topological nature of Shapefiles model, which store different components—such as manholes, pipes and pumps—in separate databases without preserving their real-world interconnections. Positional imprecision and the lack of explicit topological relationships further aggravate this issue, resulting in a representation that fails to reflect the interconnected nature of the objects. To address this problem, we propose a graph-based representation where network components are modeled as nodes and their connections as edges. This approach captures the true structure of wastewater networks while resolving disconnections and accounting for missing elements through the introduction of dummy nodes. Validation on real-world datasets demonstrates the efficacy of this method in delivering a cohesive and precise representation.

How to cite: Et-targuy, O., Delenne, C., Begdouri, A., and Benferhat, S.: Graphs as Tools for Wastewater Network Representation: Benefits and Insights, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-8379, https://doi.org/10.5194/egusphere-egu25-8379, 2025.

17:20–17:30
|
EGU25-14054
|
ECS
|
On-site presentation
Yulin Xu, Naru Sato, Yoko Ohtomo, and Youhei Kawamura

Acquiring sufficient and reliable data for tunnel construction is challenging due to high costs, data scarcity, and the site-specific nature of geological conditions. This study introduces a Geologically Constrained Conditional Tabular GAN (CTGAN) framework to address these challenges by generating synthetic data that accurately reflects the geological characteristics of tunnels. Traditional approaches often overlook inherent geological variability, leading to synthetic data that lacks real-world relevance, particularly in industrial scenarios where each tunnel or its sections exhibit unique geological environments.

The proposed framework incorporates geological attributes defined by tunneling standards, including Face condition, Compressive strength, Weathering, and Crack/fissure characteristics. These attributes are categorized into levels that represent distinct geological states while maintaining consistency with practical engineering scenarios. A physical constraint module ensures logical relationships among these features, preserving the geological and physical validity of the generated data.

Designed for industrial applications, this approach enables the augmentation of limited real-world data with samples tailored to the geological characteristics of specific tunnels. It addresses data scarcity while avoiding the generation of artificially balanced samples, instead ensuring alignment with naturally occurring geological conditions. Initial results demonstrate that the constrained CTGAN effectively replicates field-observed patterns, providing a valuable tool for improving data-driven methodologies in tunnel construction and monitoring. This research highlights the importance of leveraging domain-specific constraints in generative models, contributing to reliable, context-aware data generation for geotechnical engineering applications.

How to cite: Xu, Y., Sato, N., Ohtomo, Y., and Kawamura, Y.: Geologically Constrained CTGAN for Reliable Prediction of Tunnel Overbreak and Blasting Variables, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-14054, https://doi.org/10.5194/egusphere-egu25-14054, 2025.

17:30–17:45
17:45–18:00

Posters on site: Wed, 30 Apr, 14:00–15:45 | Hall A

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Wed, 30 Apr, 14:00–18:00
A.110
|
EGU25-21637
|
ECS
Priscillia Labourg, Sébastien Desterck, Romain Guillaume, Jeremy Rohmer, Benjamin Quost, and Stéphane Belbèze

Processing geospatial data requires to manage many sources of uncertainties; some appear in classical inference problems, some others are specific to this setting. The goal of this work is to study the management of these uncertainties via standard intervals and sets when the inference model considered relies on inverse distance weighting as it is with ordinary kriging the most used method of interpolation. We provide a general discussion with examples, together with a study of the associated optimisation problems induced by different sources of uncertainty. We conclude by an illustration on a semi-synthetic use case, generated according to data recorded via real studies.

How to cite: Labourg, P., Desterck, S., Guillaume, R., Rohmer, J., Quost, B., and Belbèze, S.: Geospatial uncertainties: a focus on intervals and spatial models based on inverse distance weightin, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-21637, https://doi.org/10.5194/egusphere-egu25-21637, 2025.

A.111
|
EGU25-18759
|
ECS
Mahmoud Hashoush and Emmanuelle Cadot

The effective utilization of data in research is often hindered by inherent challenges, including inconsistency, imprecision, missing information, and redundancy. Data imperfections are a ubiquitous challenge in scientific research, and environmental epidemiology is no exception. Environmental epidemiology relies heavily on the presence of high-quality data to establish robust associations between environmental exposures and health outcomes. This work will explore common data imperfections encountered in environmental epidemiology research, focusing on their impact on research findings and presenting strategies for mitigation. Examples from an ongoing project in the Ecuadorian Amazon will be used to illustrate these challenges and solutions. This study aims at investigating links between environmental exposure to gold mining and adverse birth outcomes in communities living in Ecuadorian Amazon. The present study underscores the substantial ramifications of outcome data imperfections, encompassing imprecision, inconsistency over time, and the existence of missing values. It also addresses exposure data imperfection, which may arise from its unavailability and the challenges associated with its detection, particularly when it comes to illegal mining. Moreover, we will discuss the challenges of integrating these two types of data and the measures that can be taken to mitigate the adverse effects of these shortcomings. We will present our findings and explore potential strategies for addressing these limitations, such as the use of remote sensing and spatial analysis tools. This research emphasizes the critical need for robust data collection and analysis methods to accurately assess environmental health risks and inform effective public health interventions.

How to cite: Hashoush, M. and Cadot, E.: Data Imperfections in Environmental Epidemiology: A Case Study from Ecuadorian Amazon, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18759, https://doi.org/10.5194/egusphere-egu25-18759, 2025.

A.112
|
EGU25-6651
Karim Douch, Peter Naylor, and Peyman Saemian

Quantifying the long-term evolution of the water cycle at the basin scale requires the estimation and integration of time series for various hydrological variables, e.g. precipitation, runoff, groundwater, and soil moisture, to name a few. The availability of Earth observation data, along with advancements in computational modelling and the expansion of in situ data networks, has led to a diverse array of products designed to estimate these variables. As a result, selecting the most appropriate products has become a significant challenge. This challenge is further complicated by the fact that estimates for a given variable can vary considerably across different products due to the inherent complexity of the variable or the uncertainties associated with the measurement process.

This study aims to tap into this wealth of products to provide single estimates of the key basin-scale hydrological variables involved in the water mass balance equation dS/dt=P−E−Q, namely precipitation rate (P), discharge (Q), evaporation rate (E) and terrestrial water storage (S), for the period 1990-2023. The approach is two-fold:

  • To start, various products for P, E, and S are selected and pre-processed. The goal of this pre-processing is to address data gaps and extend certain products back to 1990. This is particularly relevant for water storage time series, as they depend on the GRACE and GRACE-FO missions, which was launched in April 2002 and suffer from numerous gaps. To tackle this issue, we jointly process the selected time series using low-rank matrix completion and approximation techniques. The key idea is to exploit the low-rank structure of the time series data matrix to recover the underlying noise- and gap-free matrix. In addition, we analyse the potential benefits of applying this pre-processing to the multi-channel Hankel data matrix in order to take into account the autocorrelation of the signals.
  • The second step combines the pre-processed products by solving a constrained least-squares problem to generate a single estimate for each variable. This approach minimizes water mass balance misclosure while maintaining the non-negativity of discharge (Q≥0) and ensuring that each variable’s final estimate lies within the convex hulls defined by their respective time series products.

We conduct an extensive numerical analysis of the proposed method across 46 basins worldwide, using a selection of five products for precipitation, four for evaporation and four others for terrestrial water storage. Our results demonstrate that a rank-3 or rank-4 matrix strikes a good balance between data fitting and extrapolation, often reducing the average mass balance misclosure. The Hankel structure generally yields more robust and accurate results, although the optimal Hankel parameter and rank are not straightforward to determine and require further investigation. Finally, we validate the merged products by comparing them to independent estimates and assessing improvements in misclosure reduction.

How to cite: Douch, K., Naylor, P., and Saemian, P.: Hydrological data fusion: Joint gap-filling and back reconstruction via low-rank matrix approximation and completion , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6651, https://doi.org/10.5194/egusphere-egu25-6651, 2025.

A.113
|
EGU25-7012
|
ECS
Batoul Haydar, Naneé Chahinian, and Claude Pasquier

In sewer networks, adding a new element involves multiple phases, including planning, installation, and ongoing maintenance. At each stage of the element's lifecycle—whether it is a pipe, a structure, or an apparatus—different stakeholders and experts are involved. Due to variations in data practices, maintaining accurate and standardized data becomes a significant challenge. However, managing these networks requires consistent and reliable data to ensure effective decision-making and operational efficiency.

These imperfections can stem from various reasons, including discrepancies in data collection methods, outdated or incomplete documentation, and human errors during data entry. Additionally, the integration of data from diverse sources, such as GIS systems, maintenance reports, and sensor networks, often lead to inconsistencies and redundancies, complicating data processing and analysis.

For large datasets, which are common in sewer networks, it becomes increasingly difficult to identify and address inconsistencies. To address this, we built an Ontology-Based Data Access (OBDA) system which provides a unified semantic view of the data facilitating data access and integration. The system consists of a conceptual layer that provides the controlled vocabulary of sewer networks, a data layer where Montpellier Metropole open data is stored in relational databases, and a mapping layer between the two. Through this framework, common inconsistencies were identified such as missing node connections, duplicate entries, and conflicting attribute values for a specific dataset.

How to cite: Haydar, B., Chahinian, N., and Pasquier, C.: Addressing Common Inconsistencies in Sewer Networks Data, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7012, https://doi.org/10.5194/egusphere-egu25-7012, 2025.

A.114
|
EGU25-16541
|
Highlight
Salem Benferhat, Nanee Chahinian, and Carole Delenne
This presentation explores the analysis of heterogeneous geospatial data from various sources through the application of artificial intelligence (AI) tools. Wastewater networks are used as a case study to address challenges such as data completion, multi-source integration, and managing diverse data formats, including Geographic Information Systems (GIS), analog maps, and pipe inspection videos, all derived from real-world data. We will review some solutions developed under the European project Starwars (STormwAteR and WastewAteR networkS heterogeneous data AI-driven management). These solutions are based on innovative models and tools that employ logical and graph-based representations of heterogeneous data. Specifically, we aim to represent different data types — such as GIS, ITV inspection videos, and maps — as annotated graphs, incorporating the uncertainty stemming from incomplete or inconsistent information.

How to cite: Benferhat, S., Chahinian, N., and Delenne, C.: AI-Driven Analysis of Heterogeneous Wastewater Network Data, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-16541, https://doi.org/10.5194/egusphere-egu25-16541, 2025.

A.115
|
EGU25-10892
|
ECS
Ikram El miqdadi, Fatima Abouzid, Salem Benferhat, Nanée Chahinian, Carole Delenne, Aicha Alami Hassani, Hicham Ghennioui, and Jamal Kharroubi

Abstract—Accurate representation of wastewater networks is critical for effective urban infrastructure management. Extracting these networks from low-quality geographical maps presents significant challenges due to incomplete or ambiguous information. So far, we have developed a method for extracting wastewater network structures from geographical maps and representing them as graphs. This method includes detecting key network elements, such as manholes, their identifiers (using Optical Character Recognition, OCR), and pipelines connecting them. As part of this approach, we developed an efficient algorithm to accurately associate manhole identifiers with their corresponding nodes, achieving acceptable results despite the low quality of image maps. To address the issue of isolated nodes caused by undetected components, we introduced weighted edges in the graph to quantify the likelihood of connections between nodes. This enhancement improved the representation of incomplete graphs. Our current research focuses on two key challenges: creating more complete and reliable graph representations of wastewater networks and detecting arrows that represent the direction of wastewater flow.
Index Terms—Wastewater networks, Graphs, Object detection, Geographical Maps.


*Ikram El Miqdadi and Fatima Abouzid contributed equally to this work.


ACKNOWLEDGMENT
This research has received support from the European Union’s Horizon research and innovation program under the MSCA (Marie Sklodowska-Curie Actions)-SE (Staff Exchanges) grant agreement 101086252; Call: HORIZON- MSCA-2021-SE-01, Project title: STARWARS (STormwAteR and WastewAteR networkS heterogeneous data AI-driven management). We would like to express our gratitude to ”Montpellier Méditerranée Métropole” and ”La  régie des eaux de Montpellier Méditerranée Métropole” for having provided us with data essential to this research.

How to cite: El miqdadi, I., Abouzid, F., Benferhat, S., Chahinian, N., Delenne, C., Alami Hassani, A., Ghennioui, H., and Kharroubi, J.: Enhancing the Representation of WastewaterNetwork Maps Using Graphs, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-10892, https://doi.org/10.5194/egusphere-egu25-10892, 2025.

A.116
|
EGU25-18833
Ti-Hon Nguyen, Carole Delenne, and Minh Thu Tran Nguyen
This presentation addresses the problem of predicting changes in sewer pipeline size from inspection videos. We specifically focus on inspection television (ITV) videos of wastewater pipes, which play a crucial role in the management and maintenance of urban networks. On one hand, they help identify anomalies that may affect the pipes, such as obstructions or degradations. On the other hand, they provide essential information about the structural properties of the pipes and networks, including their diameter and the direction of wastewater flow. We propose a classification algorithm for ITV videos, with a particular focus on detecting diameter changes within the pipes. This task is essential for predictive maintenance and hydraulic modeling of wastewater networks. We build on Video Vision Transformer (ViViT)-based methodologies for video classification, which allow for the effective capture of both spatial and temporal relationships between the different images or frames in the video data. We specifically describe different mechanisms for generating training datasets from a subset of manually annotated images. The experimental study shows promising results on real-world ITV video data.

How to cite: Nguyen, T.-H., Delenne, C., and Tran Nguyen, M. T.: Titre Predicting Changes in Sewer Pipeline Size from Inspection Videos Using Time Series Models, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18833, https://doi.org/10.5194/egusphere-egu25-18833, 2025.

A.117
|
EGU25-19119
Salem Benferhat, Minh Thu Tran Nguyen, Nanee Chahinian, Carole Delenne, Neda Mashhadi, and Thanh-Nghi Do
In this presentation, we introduce an algorithm for extracting the structure of a wastewater network from a set of sewer inspection videos. This structure is represented as a directed graph of the pipes, automatically constructed from annotations present in the sewer videos. These annotations contain summary information about the inspection process. They include manhole identifiers, direction of inspection, direction of wastewater flow, distance travelled, date of inspection, name of the street where the pipe is located, etc. This graph, where the nodes represent manholes and the directed arcs represent pipes and wastewater flow, will provide valuable data to complement and compare with existing Geographic Information Systems. However, its construction is challenging due to the variable visibility of text in inspection videos, influenced by background brightness and irregular annotation positioning. By leveraging recurring annotations across multiple frames and using fusion strategies as well as regular expressions, we achieve reliable detection of key information such as street names and manhole identifiers, confirmed by experimental results on real wastewater inspection videos.

How to cite: Benferhat, S., Tran Nguyen, M. T., Chahinian, N., Delenne, C., Mashhadi, N., and Do, T.-N.: Exploiting Video Inspection Data in Wastewater Networks, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-19119, https://doi.org/10.5194/egusphere-egu25-19119, 2025.

Additional speaker

  • Salem Benferhat, CNRS, France