Multi-agent Geochemical Literature Data Mining System

Tianyu Yang; Karim Elezabawy; Daniel Kurzawe; Leander Kallas; Marie Traun; Bärbel Sarbas; Adrian Sturm; Stefan Möller-McNett; Matthias Willbold; Gerhard Wörner

doi:https://doi.org/10.5194/egusphere-egu26-14415

[Back] [Session GI2.1]

EGU26-14415, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-14415

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Multi-agent Geochemical Literature Data Mining System

Tianyu Yang¹, Karim Elezabawy², Daniel Kurzawe¹, Leander Kallas³, Marie Traun³, Bärbel Sarbas³, Adrian Sturm⁴, Stefan Möller-McNett³, Matthias Willbold³, and Gerhard Wörner³

Tianyu Yang et al.

¹Department for Research and Development, Göttingen State and University Library, Göttingen, Germany
²University of Göttingen, Göttingen, Germany
³Geoscience Center Göttingen, University of Göttingen, Göttingen, Germany
⁴Department for Software and Service Development, Göttingen State and University Library, Göttingen, Germany

The increasing volume and complexity of geochemical literature pose major challenges for the sustainable curation of domain-specific databases such as GEOROC (Geochemistry of Rocks of the Oceans and Continents), the world’s largest repository of geochemical and isotopic data from igneous and metamorphic rocks and minerals, aggregating more than 41 million values from over 23,000 publications. Although GEOROC underpins a wide range of geoscientific research, the extraction and harmonization of metadata from publications still relies heavily on manual effort, which significantly limits the scalability.

In this contribution, we present a novel information extraction architecture that moves beyond linear processing pipelines toward an Large Language Model (LLM)-based multi-agent system combining document layout analysis, schema-driven reasoning, and modality-aware extraction. Unlike generic LLM approaches that treat documents as continuous text streams, our architecture adopts a "Visual-First" strategy. We utilize a layout-aware backbone (MinerU, Niu et al., 2025) to decompose PDF manuscripts into a sequence of geometrically grounded primitive blocks, each representing a localized document region with associated visual and typographic features, preserving the geometric grounding essential for interpreting complex data tables. A routing agent subsequently validates and refines the initial layout classification, dynamically dispatching blocks to specialized downstream agents for text, table, or figure processing. This adaptive routing strategy improves robustness against layout variability across journals, publication years, and formatting styles.

Central to the framework is an active schema agent that operationalizes the GEOROC metadata model. Rather than treating the database schema as a static template, this agent continuously provides extraction targets, normalization rules, unit standards, and conflict-resolution policies that guide all subsequent processing steps. Text blocks are handled by an Optical Character Recognition (OCR) driven information extraction agent, table blocks by a table parsing agent capable of reconstructing complex table structures, and figure blocks by a visual reasoning agent designed to interpret diagrams and digitize plotted values. Each agent produces structured candidate values enriched with confidence estimates and fine-grained provenance, including page-level and bounding-box references to the original document.

The outputs of these modality-specific agents are consolidated by a merge-and-judge agent, which goes beyond simple aggregation. This agent performs cross-modal arbitration, unit harmonization, and deduplication, resolving conflicts between heterogeneous sources according to schema-defined priorities and data-quality criteria. The final result is a machine-readable JSON representation that preserves both extracted values and their evidential context.

By combining layout grounding, adaptive routing, schema-driven reasoning, and judgment-based integration, this system delivers a robust and extensible approach to large-scale metadata extraction. The framework substantially supports the curation process and strengthens GEOROC’s role as a FAIR-compliant reference infrastructure by enabling more efficient reuse of published geochemical data in future geochemical research.

References:

Niu, J., Liu, Z., Gu, Z., Wang, B., Ouyang, L., Zhao, Z., ... & He, C. (2025). Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186.

How to cite: Yang, T., Elezabawy, K., Kurzawe, D., Kallas, L., Traun, M., Sarbas, B., Sturm, A., Möller-McNett, S., Willbold, M., and Wörner, G.: Multi-agent Geochemical Literature Data Mining System, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14415, https://doi.org/10.5194/egusphere-egu26-14415, 2026.