- 1Department for Research and Development, Göttingen State and University Library, Göttingen, Germany
- 2University of Göttingen, Göttingen, Germany
- 3Geoscience Center Göttingen, University of Göttingen, Göttingen, Germany
- 4Department for Software and Service Development, Göttingen State and University Library, Göttingen, Germany
The increasing volume and complexity of geochemical literature pose major challenges for the sustainable curation of domain-specific databases such as GEOROC (Geochemistry of Rocks of the Oceans and Continents), the world’s largest repository of geochemical and isotopic data from igneous and metamorphic rocks and minerals, aggregating more than 41 million values from over 23,000 publications. Although GEOROC underpins a wide range of geoscientific research, the extraction and harmonization of metadata from publications still relies heavily on manual effort, which significantly limits the scalability.
In this contribution, we present a novel information extraction architecture that moves beyond linear processing pipelines toward an Large Language Model (LLM)-based multi-agent system combining document layout analysis, schema-driven reasoning, and modality-aware extraction. Unlike generic LLM approaches that treat documents as continuous text streams, our architecture adopts a "Visual-First" strategy. We utilize a layout-aware backbone (MinerU, Niu et al., 2025) to decompose PDF manuscripts into a sequence of geometrically grounded primitive blocks, each representing a localized document region with associated visual and typographic features, preserving the geometric grounding essential for interpreting complex data tables. A routing agent subsequently validates and refines the initial layout classification, dynamically dispatching blocks to specialized downstream agents for text, table, or figure processing. This adaptive routing strategy improves robustness against layout variability across journals, publication years, and formatting styles.
Central to the framework is an active schema agent that operationalizes the GEOROC metadata model. Rather than treating the database schema as a static template, this agent continuously provides extraction targets, normalization rules, unit standards, and conflict-resolution policies that guide all subsequent processing steps. Text blocks are handled by an Optical Character Recognition (OCR) driven information extraction agent, table blocks by a table parsing agent capable of reconstructing complex table structures, and figure blocks by a visual reasoning agent designed to interpret diagrams and digitize plotted values. Each agent produces structured candidate values enriched with confidence estimates and fine-grained provenance, including page-level and bounding-box references to the original document.
The outputs of these modality-specific agents are consolidated by a merge-and-judge agent, which goes beyond simple aggregation. This agent performs cross-modal arbitration, unit harmonization, and deduplication, resolving conflicts between heterogeneous sources according to schema-defined priorities and data-quality criteria. The final result is a machine-readable JSON representation that preserves both extracted values and their evidential context.
By combining layout grounding, adaptive routing, schema-driven reasoning, and judgment-based integration, this system delivers a robust and extensible approach to large-scale metadata extraction. The framework substantially supports the curation process and strengthens GEOROC’s role as a FAIR-compliant reference infrastructure by enabling more efficient reuse of published geochemical data in future geochemical research.
References:
Niu, J., Liu, Z., Gu, Z., Wang, B., Ouyang, L., Zhao, Z., ... & He, C. (2025). Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186.
How to cite: Yang, T., Elezabawy, K., Kurzawe, D., Kallas, L., Traun, M., Sarbas, B., Sturm, A., Möller-McNett, S., Willbold, M., and Wörner, G.: Multi-agent Geochemical Literature Data Mining System, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14415, https://doi.org/10.5194/egusphere-egu26-14415, 2026.