Automating Data Quality Checks for Heterogenous Datasets: A scalable approach for IACS data

Yi-Chen Pao; Boineelo Moyo

doi:https://doi.org/10.5194/egusphere-egu26-21181

[Back] [Session ESSI3.4]

EGU26-21181, updated on 29 Apr 2026

https://doi.org/10.5194/egusphere-egu26-21181

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Automating Data Quality Checks for Heterogenous Datasets: A scalable approach for IACS data

Yi-Chen Pao and Boineelo Moyo

Thünen Institute of Rural Studies, Braunschweig, Germany

The Integrated Administration and Control System (IACS) is a key instrument of the European Union's (EU) Common Agricultural Policy to monitor agricultural subsidies and support evidence-based policy. IACS provides the most comprehensive EU-wide dataset that combines detailed geospatial data with thematic attributes related to land use, livestock and measures, making it highly valuable for research on agri-environmental policies and agrobiodiversity (Leonhardt, et.al., 2024). In Germany, these data are collected independently by 14 federal states, resulting in substantial heterogeneity across datasets in terms of file format, encoding, data structure and level of completeness. These inconsistencies present major challenges for efficient data management, scientific assessments, reproducibility and the long-term reuse of the data.

This contribution presents an ongoing automated framework designed to standardise and validate raw IACS datasets across our data management pipeline, from data collection and harmonisation to data import and long-term management. Our main goal is to reduce redundancy and manual effort in the data quality check process, while enabling scalable and reproducible data quality assurance. The objective is to therefore develop an optimised, non-redundant data check system that captures structural, semantic and geospatial metadata from heterogenous datasets using a single-pass folder scan. To achieve this objective, we focus on the following approaches:

Develop an inventory-based data pipeline / architecture: A lightweight inventory object containing metadata for each file in the delivery folder
Automate routine and error – prone data quality scripts: Replace manual checks with modular and reusable automated components from a central inventory system
Enable reproducible execution and reporting: Implement a Quarto based framework (an open-source system for reproducible computational documents combining code, results and narrative) that produces human readable visualisations for technical and non-technical users

Our system leverages a diverse set of programming tools including R, Quarto, Bash, Python and SQL, from data delivery or collection to data management in the database. The approach is based on an inventory-first architecture: a lightweight yet expressive data structure generated from a single scan of raw input folder with different types of data formats. The inventory then captures essential metadata of each file such as file types, attribute schemas, geospatial extents, and identifier patterns (e.g., farm identifier, land parcel identifier). A consolidated framework of all data check scripts then enables all subsequent quality-check modules to operate efficiently without repeated file access. Executing the consolidated framework performs a range of automated data quality checks such as file integrity verification, cross-file joinability analysis, schema consistency assessment, and geospatial coherence analysis.

The resulting output in the form of an interactive Quarto dashboard then provides a comprehensive first assessment of the delivered data, where all essential metadata and errors of each file can be derived and inspected in one instance. This workflow not only minimises manual work of checking each file separately and error propagation but also ensures traceable, documented logs.

Our results show how implementing such automated data checks considerably accelerates harmonization processes and improves the data management lifecycle.

How to cite: Pao, Y.-C. and Moyo, B.: Automating Data Quality Checks for Heterogenous Datasets: A scalable approach for IACS data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21181, https://doi.org/10.5194/egusphere-egu26-21181, 2026.