ITS1.13/NH13.1 | Text-as-data, emerging data sources, and Large Language Models: Transforming Discovery in Geo- and Earth System Sciences
Thu, 08:30
EDI PICO
Text-as-data, emerging data sources, and Large Language Models: Transforming Discovery in Geo- and Earth System Sciences
AGU
Convener: Lina SteinECSECS | Co-conveners: Jens Klump, Mariana Madruga de BritoECSECS, Ni LiECSECS, Minghua Zhang, Georgia Destouni, Gabriele Messori
PICO
| Thu, 01 May, 08:30–12:30 (CEST)
 
PICO spot 2
Thu, 08:30

PICO: Thu, 1 May | PICO spot 2

PICO presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
Chairpersons: Mariana Madruga de Brito, Jens Klump, Georgia Destouni
08:30–08:35
Use of Large Language Models for data access and analysis
08:35–08:37
|
PICO2.1
|
EGU25-3038
|
ECS
|
On-site presentation
Mirko Mälicke, Alexander Dolich, and Lucas Reid

Large Language Models (LLMs) became wide-spread during only the last couple of years and are used in almost every scientific and non-scientific domain. Understanding opportunities, applications and limitations of LLMs is crucial for a risk-free, effective and useful implementation of LLMs into scientific workflows. We demonstrate that their effectiveness is maximized not through autonomous operation but through careful integration with specialized tools and contextual knowledge bases. 

Using local deployments of modern LLMs (QWen-2.5-coder, LLaMA, Mistral) comes with a number of benefits in a scientific context. Our approach employs vector embeddings for enhanced context retention and metadata databases for structured data access, enabling guided, context-aware interactions with the LLM. Local deployments allow for improved data handling and privacy, improved cost management and a higher degree of customization. Energy consumption can more easily be observed and managed, which can be a crucial property of such a system, especially compared to the newest generation of LLMs, which have extensive power  (and cost) requirements.

Opportunities and limitations are explored through two case studies: (1) an LLM-driven system that queries metadata databases to retrieve data from common open data sources and harmonizes patio-temporal subsets into data-cubes, and (2) a VBA-to-Python code translation project to preserve a legacy selection-system forest management software, which was developed in ACCESS / VBA over more than two decades. The LLM's translation process and reasoning are preserved in a vector database for consistent context maintenance and the original as well as the ‚new‘ code is searchable using the LLM to aid rebuilding a modernized software. 

Results suggest that this tool-augmented approach leads to a more reliable and maintainable solution compared to purely LLM-driven implementations, suggesting a new paradigm for integrating AI in scientific workflows where LLMs rather facilitate than replace domain-specific tools and human expertise.

How to cite: Mälicke, M., Dolich, A., and Reid, L.: Augmenting Local LLMs with Specialized Tools for Scientific Workflows, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-3038, https://doi.org/10.5194/egusphere-egu25-3038, 2025.

08:37–08:39
|
PICO2.2
|
EGU25-18345
|
ECS
|
On-site presentation
Sebastian Willmann, Thomas Ludwig, and Christopher Kadow

As artificial intelligence finds more and more applications within scientific contexts, the question on how to utilize it without sacrificing scientific integrity comes up naturally. In this context, FrevaGPT is a novel system that leverages LLMs such as GPT-4o and GPT-4o-mini to enable users to perform advanced analyses. It allows the loading and analysis of climate datasets by the LLM and moves the basis of truth to generated code, which can be checked by the user. Its backend was developed and deployed using modern software components (e.g. Rust, Python, Podman), focussing on correctness and reliability. The backend of FrevaGPT and its API is presented and the way it integrates into the larger Freva ecosystem as well as the role it plays in the improvements of ad-hoc analyses for climate data is discussed. Additionally, a suite of scientific prompts is explored to evaluate the capabilities of GPT-4o and GPT-4o-mini and how they compare in climate data analysis tasks. The prompts differ both in difficulty and complexity as well as in the requested output type: from a single number, to a graph, to a plot. This evaluation revealed that while both models demonstrated potential, GPT-4o outperformed GPT-4o-mini in handling more complex tasks involving diverse knowledge domains and programming requirements. GPT-4o-mini exhibited a higher tendency for errors and struggled with issues such as mismatched data dimensions, yet it remained a competitive, cost-effective alternative for simpler tasks. The findings highlight FrevaGPT as a significant step towards integrating advanced AI technologies into Earth sciences, bridging the gap between computational complexity and accessibility. 

How to cite: Willmann, S., Ludwig, T., and Kadow, C.: Evaluation of GPT-4o and GPT4o-mini for Climate Data Analysis with a novel tool-call software connecting different LLMs with an HPC, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18345, https://doi.org/10.5194/egusphere-egu25-18345, 2025.

08:39–08:41
|
PICO2.3
|
EGU25-15507
|
ECS
|
On-site presentation
Christopher Kadow, Jan Saynisch-Wagner, Sebastian Willmann, Simon Lentz, Johanna Baehr, Kevin Sieck, Felix Oertel, Bianca Wentzel, Thomas Ludwig, and Martin Bergemann

The chabot writing poems can do climate analysis? Large Language Models (LLMs) promise a paradigm shift as chat-based geoscientific research transformers (chatGRT) by removing technical barriers and empowering scientists to focus on deeper, more innovative inquiries. We introduce FrevaGPT, an LLM-driven “scientific assistant” integrated into Freva, the Free Evaluation System for climate data analysis on high performance computers. FrevaGPT automatically translates natural language questions into traceable, editable, and reusable scripts; retrieves relevant data and publications; executes the analyses; and visualizes the results - the scientist can focus on what matters most: science. By tapping into a wide repository of climate datasets, FrevaGPT ensures transparent, reproducible workflows and lowers the threshold for advanced data handling. Its co-pilot functionality not only delivers answers, tables, and plots, but also proactively suggests next steps, points to relevant climate modes and events, and presents associated scientific findings. Through integrated approaches to model evaluation and observational data comparisons, FrevaGPT accelerates scientific discovery and fosters interdisciplinary collaboration. Real-world use cases highlight FrevaGPT’s capacity to guide researchers beyond routine analysis, freeing them to explore innovative questions and deepen their understanding of complex climatic phenomena. As a pioneering application of LLMs in climate science, FrevaGPT illustrates how such tools can fundamentally reshape research processes, unleashing new possibilities for efficiency and creative exploration in the geosciences.

 

How to cite: Kadow, C., Saynisch-Wagner, J., Willmann, S., Lentz, S., Baehr, J., Sieck, K., Oertel, F., Wentzel, B., Ludwig, T., and Bergemann, M.: FrevaGPT: A Large Language Model-Driven Scientific Assistant for Climate Research and Data Analysis, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-15507, https://doi.org/10.5194/egusphere-egu25-15507, 2025.

08:41–08:43
|
PICO2.4
|
EGU25-20074
|
ECS
|
On-site presentation
Anzhou Li, Zhenyuan Chen, Kewei Zhou, Keyi Yang, Chenxi Yu, Andre Python, Sensen Wu, and Zhenhong Du

Many subfields of Geosciences are currently experiencing the long-tail data distribution curse. While there are large head databases within the fields, they are updated slowly, are few in number, and have poor interoperability between the few existing ones. More often, data is generated by research groups through experiments, combined with other data collected on the same topic to form a small dataset, which is hidden in the scientific literature. These chaotic data organizing manners result in low utilization rates of new data in the scientific community, hindering the implementation of data FAIR principles. To contribute to improve the process chain of long-tail data collection and linking in science, we propose GeoDaedalus, a multi-agenic large language models (LLM)-based architecture for on-demand automatic geoscience dataset construction. Starting from the research needs, GeoDaedalus achieves end-to-end automation of the scientific data curation process through a series of processes, including online search, information matching & extraction, and data fusion. To access the efficiency and accuracy in data extraction in GeoDaedalus, we simulated different use cases such as those in Geochemistry, along with complete human expert data collection processes, and constructed the first benchmark for evaluating scientific data curation processes: GeoDataBench. Results from the latest multimodal LLMs to evaluate GeoDaedalus on GeoDataBench suggest better capabilities with lower economic costs, which may become a new benchmark for GeoDataBench. We propose a Python API package with an interpretable full-process transparent logging module suitable for GeoDaedalus' users to address the highly customized needs of scientific work. Although GeoDaedalus uses geoscience data as a sample, its relevant capabilities, once reorganized, can extend to other scientific fields, marking a solid step towards Open Science for the scientific community.

How to cite: Li, A., Chen, Z., Zhou, K., Yang, K., Yu, C., Python, A., Wu, S., and Du, Z.: GeoDaedalus: Automatic Geoscience Dataset Construction, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-20074, https://doi.org/10.5194/egusphere-egu25-20074, 2025.

08:43–08:45
|
PICO2.5
|
EGU25-7059
|
ECS
|
On-site presentation
Boris Shapkin, Dmitrii Pantiukhin, Ivan Kuznetsov, Antonia Anna Jost, and Nikolay Koldunov

We present LLM-Enhanced CMIP6 Search, a Python-based tool built with LangChain and LangGraph frameworks that simplifies the discovery of and access to Coupled Model Intercomparison Project Phase 6 (CMIP6) climate data through natural language processing. By combining Large Language Models (LLMs) with retrieval-augmented generation (RAG), our system translates user queries into precise CMIP6 search parameters, bridging the gap between researchers' information needs and CMIP6's structured metadata system. The tool employs a single LLM agent coordinating three specialized tools: a search tool that maps natural language to CMIP6 parameters (such as model, experiment, and variable identifiers), an access tool that both verifies data availability and generates ready-to-use Python code for retrieval, and an adviser tool that helps refine search criteria. To improve search accuracy, we developed a refined database of CMIP6 metadata descriptions, optimizing vector-based similarity matching between user queries and technical CMIP6 terminology, providing a foundation for more intuitive climate data discovery.

How to cite: Shapkin, B., Pantiukhin, D., Kuznetsov, I., Jost, A. A., and Koldunov, N.: LLM-Enhanced CMIP6 Search, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7059, https://doi.org/10.5194/egusphere-egu25-7059, 2025.

08:45–08:47
|
PICO2.6
|
EGU25-13656
|
ECS
|
On-site presentation
Dmitrii Pantiukhin, Boris Shapkin, Ivan Kuznetsov, Antonia Anna Jost, Thomas Jung, and Nikolay Koldunov

PANGAEA GPT is a Large Language Model (LLM) multi-agent framework that aims to streamline the work of geoscientists with the diverse Earth system datasets held in the PANGAEA archive (pangaea.de), a widely used data repository in Earth and Environmental Sciences. Built on top of the LangChain library and the LangGraph framework, it uses a multi-agent collaboration approach with a centralized supervisor agent that interprets incoming user queries and then coordinates specialized agents according to task requirements. These specialized agents include the Search Agent, which performs data lookups via API requests to PANGAEA and locates related publications via Crossref (to further answer questions about what has been published based on a particular dataset). They also include an orchestra of Data Agents configured in different modes - such as "oceanographer," "ecologist," or "geologist" - to perform dataset-specific analyses. Each Data Agent operates within a dedicated Python environment that allows for code manipulation, data analysis, visualization, and iterative refinement of results. The Supervisor Agent then aggregates the output from these Data Agents and delivers a consolidated response back to the user (including generated analysis scripts). The current framework has been shown to excel at providing a list of relevant datasets, locating related publications, and performing statistical analysis upon user request, greatly simplifying data discovery and use for geoscientists. In addition to the rapid discovery, analysis, and visualization of heterogeneous datasets, a particularly valuable end goal of PANGAEA GPT is to generate concise documentation for historical or underutilized datasets that currently lack related publications, ensuring that their valuable information endures and drives further scientific discoveries.

How to cite: Pantiukhin, D., Shapkin, B., Kuznetsov, I., Jost, A. A., Jung, T., and Koldunov, N.: PANGAEA GPT: A Coordinated Multi-Agent Architecture for Earth System Data Discovery and Analysis, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-13656, https://doi.org/10.5194/egusphere-egu25-13656, 2025.

08:47–08:49
|
PICO2.7
|
EGU25-8220
|
On-site presentation
Ivan Kuznetsov, Antonia Anna Jost, Dmitrii Pantiukhin, Boris Shapkin, Maqsood Mubarak Rajput, Thomas Jung, and Nikolay Koldunov

ClimSight is an innovative open-source climate information system that integrates large language models (LLMs) with geographical and climate data to provide climate information to everyone, everywhere. This description builds upon the original paper [1] by presenting the system’s recent developments and updated methodologies. By leveraging high-resolution data, including local conditions and climate projections, combined with retrieval-augmented generation systems (based on climate reports, scientific literature, and other sources), and an agent-based architecture, ClimSight addresses the limitations of general-purpose LLMs in climate data analysis, ensuring accurate, reliable, and reproducible outputs. This presentation details the enhanced methodologies employed in ClimSight to deliver climate assessments for specific locations and activities. The system utilizes the LangGraph and LangChain packages to manage agents and LLM calls, providing flexibility in selecting different LLM models, with current implementations relying on OpenAI’s models. The effectiveness of ClimSight is demonstrated through selected examples and evaluations, highlighting its potential to democratize access to localized climate information.

[1] Koldunov, N., Jung, T. Local climate services for all, courtesy of large language models. Commun Earth Environ 5, 13 (2024). https://doi.org/10.1038/s43247-023-01199-1

 

How to cite: Kuznetsov, I., Jost, A. A., Pantiukhin, D., Shapkin, B., Rajput, M. M., Jung, T., and Koldunov, N.: ClimSight: Leveraging LLMs for Revolutionizing Climate Services, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-8220, https://doi.org/10.5194/egusphere-egu25-8220, 2025.

08:49–08:51
|
PICO2.8
|
EGU25-18732
|
On-site presentation
Aurélie Montarnal, Cécile Gracianne, Gaëtan Caillaut, Alexandre Sabouni, Anouck Adrot, Sylvain Chave, Loïc Rigart, Farid Faï, and Samuel Auclair

The increasing availability of social media data offers valuable opportunities for real-time crisis monitoring and disaster management. However, extracting actionable insights from these unstructured, multilingual, and often ambiguous data sources remains a significant challenge, particularly in non-English contexts. In this context, natural language processing (NLP) and machine learning techniques are key tools to automated data extraction and enhance situational awareness for crisis managers, particularly during flash floods and earthquakes.

In crisis management, the rapidly processing and transformation of unstructured social media data into actionable information is essential for effective decision-making. While the literature  highlights the value of social media for improving the situational awareness of decision-makers, extracting relevant information remains resource-intensive, especially for most French crisis management units, which lack the necessary tools and resources. Although, several systems exist for extracting automatically information in social media, only few of them deal with French language. One of the main challenges with social media data lies in its inherent ambiguity including semantic variability (context-dependent meanings of words and idioms), informal language (abbreviations, typos, emojis, and neologisms), entity ambiguity (e.g., locations or organizations with identical names), and a high proportion of noisy or irrelevant content.  

The French ReSoCIO project addresses these challenges by bringing together experts in earth sciences, AI, social sciences and specialists and software developers in risk management and forecasting  to develop a novel approach to social data disambiguation for geospatial visualization of crisis situations. This study introduces an innovative pipeline that combines filtering, entity linking, and geolocation integration to enhance data disambiguation and tailored for real-time predictions. The pipeline first employs a supervised classifier to filter out unrelated tweets. Relevant messages are then processed through an entity linking module, where detected entities are disambiguated by matching them with Wikidata entries. This process leverages embeddings from Wikipedia and compares them with tweet embeddings using CamemBERT, enriching extracted data with contextual and geospatial information. The final step employs large language models LLMs to summarize and linked the extracted information, ensuring that stakeholders receive concise and accurate overviews validated against structured event reports. By characterizing and predicting the impacts and damages of crisis events, this pipeline provides a robust framework for transforming fragmented online data into structured, actionable knowledge.

The system's performance aligns with state-of-the-art models, effectively identifying entities that correspond with the spatiotemporal patterns of actual natural disasters. While this suggests the system's potential utility in enhancing situational awareness for crisis managers by providing timely and accurate geolocated information extracted from social media posts, experimental observation conducted during the ReSoCIO project confirms the contribution of this disambiguation pipeline to French crisis managers.

How to cite: Montarnal, A., Gracianne, C., Caillaut, G., Sabouni, A., Adrot, A., Chave, S., Rigart, L., Faï, F., and Auclair, S.: ReSoCIO: Towards geospatial visualization of Social Media Data by AI-driven Disambiguation. Application  to Crisis Management in the French Context., EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18732, https://doi.org/10.5194/egusphere-egu25-18732, 2025.

08:51–08:53
|
EGU25-16268
|
ECS
|
Virtual presentation
Alberto Previati, Valerio Silvestri, and Giovanni Crosta

The advent of extensive digital datasets coupled with advancements in artificial intelligence (AI) is revolutionizing our ability to extract meaningful insights from complex patterns in natural sciences. In this context, the targeted classification of textual descriptions, particularly those detailing the granulometry of unconsolidated sediments or the fracturing state of rock masses, combining supervised deep learning and natural language processing (NLP) is a promising method to refine large-scale geological and hydrogeological models by enriching them with increased data volume.

Several databases are replete with qualitative geological data such as borehole logs, which, while abundant, are not readily assimilated into quantitative hydrogeological modeling due to the extensive time required to process the written descriptions into operationally significant units like hydrofacies. This conversion typically necessitates expert analysis of each report but can be expedited through the application of NLP techniques rooted in AI.

The primary objectives of this research are twofold: (i) to develop a robust classification model that leverages geological descriptions alongside grain size data, and (ii) to standardize a vast array of sparse and heterogeneous stratigraphic log data for integration into large-scale hydrogeological applications.

The Po River alluvial plain in northern Italy (45,700 km²) serves as the pilot area for this study due to the homogeneous shallow subsurface geology, the dense borehole coverage and the availability of a pre-labelled training set. This research demonstrates the conversion of qualitative geological information from a very large dataset of stratigraphic logs (encompassing 387,297 text descriptions from 39,265 boreholes), into a dataset of semi-quantitative information. This transformation, primed for hydrogeological modeling, is facilitated by an operational classification system using a deep learning-based NLP algorithm to categorize complex geological and lithostratigraphic text descriptions according to grain size-based hydrofacies. A supervised text classification algorithm, founded on a Long-Short Term Memory (LSTM) architecture was meticulously developed, trained and validated using 86,611 pre-labelled entries encompassing all sediment types within the study region. The word embedding technique enhanced the model accuracy and learning efficiency by quantifying the semantic distances among geological terms.

The outcome of this work is a novel dataset of semi-quantitative hydrogeological information, boasting a classification model accuracy of 97.4%. This dataset was incorporated into expansive modeling frameworks, enabling the assignment of hydrogeological parameters based on grain size data, integrating the uncertainty stemming from misclassification. This has markedly increased the spatial density of available information from 0.34 data points/km² to 8.7 data points/km². The study findings align closely with the existing literature, offering a robust spatial reconstruction of hydrofacies at different scales. This has significant implications for groundwater research, particularly in the realm of quantitative modeling at a regional scale.

How to cite: Previati, A., Silvestri, V., and Crosta, G.: Leveraging Deep Learning and Natural Language Processing for hydrogeological insights from borehole logs, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-16268, https://doi.org/10.5194/egusphere-egu25-16268, 2025.

08:53–08:55
|
PICO2.10
|
EGU25-6762
|
ECS
|
On-site presentation
Lina Stein, Birgit M. Pfitzmann, S. Karthik Mukkavilli, Ugur Ozturk, Peter W. J. Staar, Cesar Berrospi, Thomas Brunschwiler, and Thorsten Wagener

A natural hazard event that highly impacted a society might trigger a wave of post-disaster research analysis, which looks into the cause of the disaster, the types of impact, or any lessons learned to prevent similar events in the future. In short, post-disaster research contains valuable knowledge that should be utilized in disaster risk management. However, in the past 70 years, the scientific community published around 600,000 articles on hydro-hazards, such as floods, droughts, and landslides. Finding articles that describe specific disaster events and synthesizing their knowledge is not humanly possible anymore due to near exponentially increasing numbers of publications. However, recent advancements in large language models allow the analysis and extraction of described disaster events in the scientific literature.

Here we make use of the Wealth over Woe scientific abstract dataset (Stein et al. 2024), with abstracts that were automatically annotated for hydro-hazards and geolocation.  It allows us to track publication trends and to identify disaster events that triggered a wave of new research. We additionally make use of the large language model Llama 70B to extract specific hazard events mentioned in each abstract (e.g. 2003 summer drought in Europe, Pakistan flood in 2010, 2002 Elbe flood, etc.) as well as other described details surrounding the event.

While we know that hydro-hazard research is biased against low-income countries, exceptional disaster events can shift research priorities for several years. The additional funding can support valuable local post-disaster research. The named event recognition can therefore help us answer questions such as: What kind of hydro-hazards are studied in detail and where? What are the key research foci for post-disaster analysis? And are there regional differences to these answers?

How to cite: Stein, L., Pfitzmann, B. M., Mukkavilli, S. K., Ozturk, U., Staar, P. W. J., Berrospi, C., Brunschwiler, T., and Wagener, T.: Automated disaster event extraction to understand lessons learned: A large-scale text analysis on the scientific literature of floods, droughts, and landslides. , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6762, https://doi.org/10.5194/egusphere-egu25-6762, 2025.

08:55–10:15
Coffee break
Chairpersons: Lina Stein, Gabriele Messori, Ni Li
Text and other emerging data sources
10:45–10:47
|
PICO2.1
|
EGU25-3179
|
ECS
|
On-site presentation
Jan Sodoge, Taís Maria Nunes Carvalho, and Mariana Madruga de Brito

An increasing volume of abstracts across geoscience is presented annually at the EGU General Assembly (GA). To manage thousands of abstracts, the conference is structured into divisions, thematic sessions, and individual sessions. However, creating rigid organizational boundaries that separate research contradicts commonly demanded interdisciplinary research: researchers may be only exposed to ideas within their peer group, reinforcing existing perspectives. Such phenomena of filter bubbles and selective exposure to information have been observed in various contexts to limit creativity and innovation. Yet, it persists and remains underexplored in the context of large scientific conferences like the EGU GA. 
In this contribution, we demonstrate how natural language processing allows for breaking the scientific silos to encourage interdisciplinary interaction at EGU GA. We use sentence embeddings (SBERT) to evaluate the semantic similarity between scientific abstracts and identify closely related ones. We analyzed 5,000 randomly selected abstracts per EGU GA, identifying the 10 most similar abstracts. The results show that participants who focus exclusively on abstracts within their thematic session potentially overlook 44% of the ten most relevant contributions to their research, underscoring the risk of missed interdisciplinary connections. Beyond those findings, we will outline existing projects and plans for improving the conference experience and making geoscience research more interdisciplinary.

How to cite: Sodoge, J., Carvalho, T. M. N., and de Brito, M. M.: Encouraging interdisciplinary connections at EGU through text mining , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-3179, https://doi.org/10.5194/egusphere-egu25-3179, 2025.

10:47–10:49
|
PICO2.2
|
EGU25-7690
|
On-site presentation
Jens Klump, John Hille, Magda Guglielmo, and Brint Gardner

Of the generative Artificial Intelligence (AI) systems, Retrieval Augmented Generation (RAG) has attracted a lot of attention for its ability to support natural language queries into large text corpora with the help of Large Language Models (LLM). In a pilot project, we explored RAG and LLM finetuning as tools for exploring the abstracts of the EGU General Assembly as a text corpus.

To ingest the text corpus, we built a processing pipeline to convert the abstract corpus from XML to JSON in a structure that would make it easy to import the data into a vector storage system. For additional context, we added the association of an abstract with the scientific divisions of the EGU, including co-organisation between two or more divisions. This information was not available at the time of this project and had to be scraped from archived versions of the conference online programme.

The RAG system is designed to read various model formats, such as GGUF, GPTQ, and Transformers models. It also integrates with a vector storage solution to read and use conference abstracts to provide enriched responses. Its implementation uses Apptainer for containerised execution.

The first responses from the RAG system to natural language queries produced promising results. The inclusion of links to the source materials allowed us to compare the query response with the information in the source materials. However, evaluating generative AI models is not trivial since one query can produce multiple results. Using a well-understood text corpus and being able to trace the probable origin texts of the results allows us to evaluate the quality of the results and better understand the origin of deficient RAG responses.

How to cite: Klump, J., Hille, J., Guglielmo, M., and Gardner, B.: Building a RAG system for querying a large corpus of conference abstracts, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7690, https://doi.org/10.5194/egusphere-egu25-7690, 2025.

10:49–10:51
|
PICO2.3
|
EGU25-578
|
ECS
|
Highlight
|
On-site presentation
Livia Stein Freitas, Theo Carr, Tessa Giacoppo, Timothy Walker, and Caroline Ummenhofer

During oceanic expeditions, pre-modern sailors meticulously recorded information about their longitude and latitude, the local wind conditions, and the state of the sea. For a long time, prior to precision instrumentation, sailors provided qualitative recordings of wind speed instead of quantitative (e.g.: “light breeze” instead of 5 meters/second). For that reason, this textual data requires additional processing before being usable for comparison with modern instrumental data or reanalysis products. In particular, the phrases used in wind descriptions can be classified using the Beaufort Wind Force Scale (codified in 1805), that consists of thirteen base wind force levels assigned a numerical value. Manually categorizing all the distinct and unique variations on the wind information can be ambiguous and time consuming. Because of historical weather data’s importance for climate science, we investigated if machine learning could speed up this process while producing accurate results.

Using a novel dataset of >100,000 (sub)daily maritime weather recordings from historical whaling ship logbooks housed across New England archives and covering the period 1820-1890, here we show that k-means nearest neighbors and density based spatial clustering models, while efficient, generate outputs with reduced accuracy when compared to the data classified by humans. However, there is a noticeable improvement in the quality of the clustering when we introduce the Beaufort Wind Force Scale’s thirteen categories as starting centroids. These results show that machine learning could be a useful tool for wind term processing and that well-placed human input aids in the accuracy of outcomes. Therefore, cross-validation methods are employed to help with the interpretability of the machine models utilized. Additionally, various neural network clustering models are evaluated regarding their efficacy, such as a two sliding windows text GNN-based (TSW-GNN) model, since its graph-based approach has demonstrated improved accuracy in classifying textual data as compared to language representation models.

How to cite: Stein Freitas, L., Carr, T., Giacoppo, T., Walker, T., and Ummenhofer, C.: “Old Texts, New Tech, Better Theory”: Applying Machine Learning to Textual Weather Data from Historical Ship Logbooks , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-578, https://doi.org/10.5194/egusphere-egu25-578, 2025.

10:51–10:53
|
PICO2.4
|
EGU25-6548
|
ECS
|
On-site presentation
Bram Valkenborg, Olivier Dewitte, and Benoît Smets

Efforts worldwide aim to collect detailed information on the spatial and temporal distribution of natural hazards to improve our understanding of their occurrence and ultimately prevent their impacts. However, data on the location, timing, and impact of hazards remain scarce in many regions, even in the most exposed ones. Data collection methods are usually framed around earth observation approaches, sometimes combined with citizen science. Such approaches can be time-consuming, resource-intensive, and may fall short regarding data needs, especially at large scales. Combining these methods with complementary approaches could better address these challenges. We introduce a multilingual tool that uses natural language processing techniques to extract information on geo-hydrological hazards from online news articles. The tool is developed based on a worldwide application where we processed ~ 5.8 million articles published between 2017 and 2023 across 58 languages. The articles were extracted from GDELT (Global Database of Events, Language, and Tone), a global database monitoring events through online news articles. Using large language models, the tool analyzes articles at the paragraph level through three major steps: (1) filtering paragraphs for relevancy, (2) extracting information on the location (down to street level), timing, and impact, and (3) clustering information into events. This multilingual approach enabled the tool to extract and analyze 12.438 flood events, 1.312 landslide events, and 1.086 flash flood events globally for 2023 alone, providing ~ 20 times more data than current disaster databases and improving the coverage worldwide. In regions such as South and Central America, Europe, and Asia, where English is not the primary reporting language, non-English texts were the most important source of information. Especially in South and Central America, where non-English (primarily Spanish and Portuguese) paragraphs outnumbered English paragraphs by a factor of five. The proposed tool provides a new way to extract an unprecedented level of data on geo-hydrological hazards, forming a complementary source of information to existing methodologies. Beyond geo-hydrological hazards, the tool can be used to document other hazards, including earthquakes, wildfires, or volcanic activity. In addition, with this specific application, we provide a new extensive global dataset on impactful geo-hydrological hazards, which offers new opportunities for improving our understanding of these processes and their impact on continental to global scale.

How to cite: Valkenborg, B., Dewitte, O., and Smets, B.: A multilingual tool for the documentation of impactful geo-hydrological hazards using online news articles: a worldwide application, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6548, https://doi.org/10.5194/egusphere-egu25-6548, 2025.

10:53–10:55
|
PICO2.5
|
EGU25-19995
|
ECS
|
On-site presentation
Applying AI to Classify Hazards from Text
(withdrawn)
Hamish Patten, Fiona Johnson, and Alexander Bayer
10:55–10:57
|
PICO2.6
|
EGU25-15719
|
ECS
|
On-site presentation
Carlo Guzzon, Raül Marcos Matamoros, Dimitri Marinelli, Montserrat Llasat-Botija, and Maria Carmen Llasat-Botija

Spain and the Mediterranean coast are largely affected by flash floods, which are generated by intense, localized storms within smaller basins (Gaume et al., 2016). In Spain, floods are the country's primary recurring natural disaster, accounting for nearly 70% of the compensation amount issued by the Consorcio de Compensación de Seguros (CCS, 2021).  Improving early warning systems is crucial to reducing risks associated with floods. Comprehensive and up-to-date databases of past flood events serve as essential tools for developing such systems.

This study presents the implementation of an AI-based text-mining tool designed to automate the creation and updating of flood event databases using information extracted from newspapers. This tool is tailored to enhance and expand INUNGAMA, an impact database of flood events in the Catalonia region (Barnolas and Llasat, 2007), by extracting data from ‘La Vanguardia’, a major Catalan newspaper. The text-mining tool involves several steps, starting with the retrieval of potentially relevant news through keyword-based queries on the newspaper’s online archive. To eliminate irrelevant news, a natural language processing (NLP) model filters the initial dataset. Impact data of flood events are extracted by analyzing the newspaper text with an advanced NLP model; the extracted information is saved in a machine-readable and consistent format. Finally, the tool integrates the extracted data with the pre-existing INUNGAMA database, either by merging new information with existing events or by creating entries for previously undocumented events.

The tool was calibrated and tested using the INUNGAMA database. Its ability to download and filter relevant articles was assessed over six non-consecutive months, demonstrating excellent performance in identifying and distinguishing flood events. Furthermore, the AI model exhibited high accuracy in extracting impact data from the text when tested over one year of newspaper data.

The proposed AI-based tool offers a powerful solution for automating the creation and updating of flood impact databases, providing a solid foundation for developing early warning systems aimed at risk reduction. The text-mining tool is designed to complete the INUNGAMA database and to update it up to the present. Moreover, it can be adapted for creating new databases in other regions using different newspaper sources.

 

This research has been done in the framework of the Flood2Now project, Grant PLEC2022-009403 funded by MCIN/AEI/10.13039/501100011033 and by the European UnionNextGenerationEU/PRTR and the I-CHANGE Project from the European Union’s Horizon 2020 Research and Innovation programme under grant agreement 101037193.

How to cite: Guzzon, C., Marcos Matamoros, R., Marinelli, D., Llasat-Botija, M., and Llasat-Botija, M. C.: An AI-Based Text-Mining Tool for flood impact data extraction from newspaper information, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-15719, https://doi.org/10.5194/egusphere-egu25-15719, 2025.

10:57–10:59
|
PICO2.7
|
EGU25-14874
|
ECS
|
On-site presentation
Taeyong Kim and Minjune Yang

Since the mid-20th century, geology in South Korea has expanded rapidly, driven by interdisciplinary research. This study explores the key themes and historical trends of geological research in South Korea, analyzing interconnections among topics using a dataset of 10,380 research publications from 10 geological journals (1964 – 2024). Latent Dirichlet Allocation (LDA) identified 18 distinct topics, categorized into emerging (n = 10), classic  (n = 3), and stable topics (n = 5). Additionally, the scope of the research topics was analyzed, revealing broad (n = 14) and narrow topics (n = 4). Topics were grouped into four clusters (“Engineering group”, “Environment group”, “Field survey group”, and “Chemistry group”) based on Euclidean distance, and network analysis visualized their relationships and interaction strengths. The study revealed shifts in research focus: “Economic geology”, “Petrology”, and “Stratigraphy” dominated before 1996, while “Environmental geology” and “Hydrogeology” gained prominence afterward. Among clusters, the “Engineering group” showed the strongest connections (mean weight = 5.18). These findings highlight the evolving focus of geological research in South Korea, providing insights into interdisciplinary collaboration opportunities and future research directions.

Acknowledgment: This research was supported by Global - Learning & Academic research institution for Master’s·PhD students, and Postdocs(LAMP) Program of the National Research Foundation of Korea(NRF) grant funded by the Ministry of Education(No. RS-2023-00301702).

How to cite: Kim, T. and Yang, M.: Exploring Geological Research Themes and Trends in South Korea Using Topic Modeling, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-14874, https://doi.org/10.5194/egusphere-egu25-14874, 2025.

10:59–11:01
|
PICO2.8
|
EGU25-7072
|
ECS
|
On-site presentation
Chee Hui Lai and Jianshi Zhao

Water governance systems in many river basins require improvement to adapt to changes in environmental and socioeconomic landscapes. However, water governance reformation is a complex and challenging process. In particular, policymakers and water managers need a comprehensive understanding of the fundamental components that form the current water governance systems. Only then can new rules be introduced to alter the governance characteristics of these systems. This process is especially challenging in the case of interstate rivers that flow across multiple states, where governance systems are characterized by complex interstate water agreements and/or laws that cover various cross-state water management affairs and regulate stakeholders from different states. We use the institutional grammar (IG) to parse water agreements and laws, generating text-based data for assessing the institutional characteristics of interstate water governance systems. The IG decomposes written statements in the documents into different syntactic components. Based on these components, the functions of the statements can be identified and categorized into one of seven types of institutional rules, as defined by the rule concepts of the institutional analysis and development (IAD) framework. By analyzing these findings with indicators of governance characteristics, we are able to assess the allocation of water governance responsibilities and the degree of coordination within a water governance system to identify its institutional characteristics. We applied this method to analyze the water-related laws that form the governance systems of the Yellow River Basin (YRB) in China. The findings reveal that the YRB’s water governance system has undergone five major stages of structural evolution since 1987. During this process, the basin’s focus in water governance has shifted from flow regulation to water consumption governance, as well as expanding its governance scope to include interstate water administration and drought management. Currently, the YRB’s water governance systems are dominated by centralized governance structures characterized by the centralization of water governance responsibilities and a high degree of stakeholder coordination. The method demonstrates that text-based data generated through parsing water agreements and laws can systematically analyze the complex institutional characteristics of water governance systems. This research contributes to the advancement of text-based method for water governance analysis.

How to cite: Lai, C. H. and Zhao, J.: Institutional grammar as a text-based method for water governance analysis , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7072, https://doi.org/10.5194/egusphere-egu25-7072, 2025.

11:01–11:03
|
PICO2.9
|
EGU25-4468
|
On-site presentation
Andreas Niekler, Taís Maria Nunes Carvalho, and Mariana Madruga de Brito

The increasing use of text as data in environmental research offers valuable opportunities, but the inherent biases within textual sources like news, social media, or disaster reports necessitate moving beyond purely descriptive analyses. While NLP techniques like topic modeling and categorical annotations can identify emergent patterns, they often fail to elucidate the underlying causal mechanisms driving observed phenomena, especially within the complex interplay of anthropogenic activities, societal structures, and environmental outcomes. The reductionist tendencies of NLP, especially when dealing with complex social phenomena, often neglect the nuances of language and context, leading to potentially trivial or superficial findings when results are merely validated post-hoc against existing literature. This highlights the missed chance to leverage the extensive existing literature on climate research, for instance, to inform the a priori development of theoretical frameworks that could guide the research process. This not only validates the variable constructs but also prevents the validation and discovery of findings solely based on detected patterns. Instead, it explicitly searches for patterns that are relevant and address the research question. In a way, it tests what is expected or unexpected, minimizing blind spots and positivist statements. This approach doesn't hinder exploratory approaches that yield new hypotheses; rather, it meaningfully combines them with the actual research question.

To address this, a theoretically grounded approach is crucial, moving from describing "what" to explaining "why." This entails embedding the research question within a robust theoretical framework, operationalizing key concepts into measurable variables, and developing a coding scheme that links these variables to their manifestations in the text. This coding scheme is not just an arbitrary set of labels, but a theoretically grounded codebook that ensures the validity of subsequent analyses. NLP then serves as an annotation tool, generating data that reflects these operationalized variables, with rigorous validation ensuring the annotations' accuracy. Instead of simply describing the distributional properties of these annotations, statistical modeling techniques can be used to test a priori hypotheses derived from the theoretical framework. By comparing models based on both statistical fit and theoretical plausibility, researchers can identify the most probable explanation for the observed relationships, thereby uncovering the causal mechanisms at play.

In this contribution, we exemplify this approach by utilizing LLM-based information extraction to annotate disaster impacts from scientific papers based on predefined and well understood classes and their textual representation. We employ Structural Equation Models, Exploratory Factor Analysis, and regression to test models derived by literature and compare their probability given the data, demonstrating how this method produces robust, explainable results that go beyond the surface-level findings of exploratory approaches and move towards a deeper understanding of complex environmental phenomena. This integrated approach allows researchers to not just identify patterns in large textual datasets, but to understand the reasons behind them and generate valid and reliable insights in the field of environmental research.

How to cite: Niekler, A., Carvalho, T. M. N., and de Brito, M. M.: Theoretical-Deductive Content Analysis of Text as Data in Environmental Research, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4468, https://doi.org/10.5194/egusphere-egu25-4468, 2025.

11:03–11:05
|
PICO2.10
|
EGU25-5669
|
On-site presentation
Yang Song, Yan Yang, and Zhenhan Wu

Advancements in map visualization technology offer innovative approaches for presenting geological information. Geographic data services like DataV Atlas enable users to generate professional geographic outputs through straightforward SQL queries, facilitating the integration, real-time updates, and analysis of multi-source data. This visualization not only deepens users' understanding of geological map data but also enhances the efficacy of data analysis.

In summary, the diverse data types within geological map databases and their applications across modern technological platforms provide critical support and innovative opportunities for geological research and resource management. As technology evolves, the utilization of geological data is expected to become even more varied, injecting new vitality into scientific inquiry and practical applications.

Furthermore, the Global Layer platform, a key component of the IUGS Deep-time Digital Earth (DDE) program, offers a comprehensive suite of online resources for exploring and analyzing Earth's geological history. This initiative empowers participants with skills to navigate extensive geological datasets, conduct online analyses, and engage in meaningful scientific research. It also highlights the impact of advancements in artificial intelligence, cloud computing, and other technologies in enhancing data-driven geoscientific investigations.

Central to the Global Layer platform(https://globallayer.deep-time.org/) is a globally significant geological map at a scale of 1:5 million, encompassing various geological attributes, including chronostratigraphic units, structural features, and seafloor morphology. The platform encourages public engagement through functionalities like data retrieval, interactive browsing, and image generation, facilitating a seamless user experience. During its implementation, extensive data sourcing on the DDE platform was conducted, tracing the provenance of global geographic data and acquiring supplementary geological maps and databases. This effort aimed to enrich geological and geophysical datasets for oceanic islands while optimizing vectorization processes to ensure data accuracy and integrity.

As the Global Layer platform promotes the digital dissemination of geological maps, it significantly enhances public awareness of geological and geographic sciences while encouraging environmental stewardship. This initiative is crucial for advancing societal progress and empowering the DDE community to embrace the future of geographic spatial analysis, unlocking the rich geological heritage of our planet.

How to cite: Song, Y., Yang, Y., and Wu, Z.: Advances in the Utilization of Geological Map Databases with Diverse Data Types on Global Layer Platforms, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5669, https://doi.org/10.5194/egusphere-egu25-5669, 2025.

11:05–11:07
|
PICO2.11
|
EGU25-19884
|
On-site presentation
Georgia Destouni

Water is both a key resource and a source of risks for society. Societal risks are posed, for example, by waterborne pollutant spreading with related water and environmental quality impacts and by weather extremes of floods and droughts. In its continuous movement through the landscape, the flowing water links the world's hydrological systems with the human-social systems that use the water and interact with it. The interactions are social-hydrological and imply important water resource and risk impacts and feedbacks. However, research has not yet comprehensively, in integrated quantitative and qualitative ways, studied the social-hydrological system coupling and the roles it plays for sustainable development across various world regions with different climate, societal and environmental conditions. This presentation outlines some key needs and linkage pathways for qualitative social perception and prioritization data along with quantitative data and modeling toward such research integration and big-picture science for the world's water system on land, its social-hydrological interactions, and the roles they play for local to global sustainability.

How to cite: Destouni, G.: Social perception and prioritization data for integrated big-picture science of water environmental change and sustainability, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-19884, https://doi.org/10.5194/egusphere-egu25-19884, 2025.

11:07–12:30