EGU26-12357, updated on 14 Mar 2026
https://doi.org/10.5194/egusphere-egu26-12357
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Thursday, 07 May, 11:50–12:00 (CEST)
 
Room -2.33
Using artificial intelligence to automate and expedite the harmonization of environmental data
Tyler Karns1, Cedric Hagen2, Krutika Deshpande3, Michael SanClements2, Christine Laney2, Benjamin Ruddell3, Henry Loescher2, and Tyson Swetnam4
Tyler Karns et al.
  • 1AS&T, Battelle Memorial Institute, United States of America
  • 2National Ecological Observatory Network, Battelle Memorial Institute, United States of America
  • 3School of Informatics, Computing & Cyber Systems, Northern Arizona University, United States of America
  • 4CyVerse, University of Arizona, United States of America

Data harmonization–the process of unifying disparate datasets into compatible formats and comparable units–is critical for global environmental research but remains prohibitively time-consuming and expensive. While many global environmental datasets could be assembled from existing available data, potentially offering transformative insight in pressing environmental issues, the exhaustive efforts to harmonize data is currently unfeasible for most scientific funding cycles. For example, cross-network studies (such as those between the U.S. National Ecological Observatory Network (NEON), the European Integrated Carbon Observation System (ICOS), and the Australian Terrestrial Ecosystem Research Network (TERN)) requires weeks-to-years of manual schema mapping, unit conversions, alignment, quality flag standardization for even a small number of data products, and more effort needed before any analyses can begin. Here, we present a large language model (LLM)-based agentic system designed to automate many of these data harmonization steps by leveraging semantic understanding of scientific metadata and documentation. This system is designed to ingest raw datasets and metadata, interpret variable semantics within scientific contexts, and generate tailored transformation pipelines. We tune this approach using a subset of previously manually harmonized environmental data from NEON, ICOS, and TERN, as well as the South African Environmental Observation Network (SAEON) and the Integrated European Long-Term Ecosystem, Critical Zone and Socio-Ecological Research Infrastructure (eLTER), as part of an effort by the Global Ecosystem Research Infrastructure (GERI) to build globally harmonized ecological drought datasets. Using these harmonized ecological drought datasets from across the globe, we test the efficacy of this LLM-based agentic system measuring accuracy, time/labor efficiencies, and data integrity preservation as compared to manual data harmonization workflows. Pressing global environmental challenges require rapid synthesis of global environmental data. By reducing data harmonization time from months to hours, these artificial intelligence (AI) tools will enable scientists to focus on analysis and modeling rather than data wrangling, ultimately accelerating research in these critical areas of global environmental science.

How to cite: Karns, T., Hagen, C., Deshpande, K., SanClements, M., Laney, C., Ruddell, B., Loescher, H., and Swetnam, T.: Using artificial intelligence to automate and expedite the harmonization of environmental data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12357, https://doi.org/10.5194/egusphere-egu26-12357, 2026.