- 1Mineral Resources, CSIRO, Kensington, Australia (jens.klump@csiro.au)
- 2Information Management & Technology, CSIRO, Eveleigh, Australia
- 3Information Management & Technology, CSIRO, Clayton, Australia
Of the generative Artificial Intelligence (AI) systems, Retrieval Augmented Generation (RAG) has attracted a lot of attention for its ability to support natural language queries into large text corpora with the help of Large Language Models (LLM). In a pilot project, we explored RAG and LLM finetuning as tools for exploring the abstracts of the EGU General Assembly as a text corpus.
To ingest the text corpus, we built a processing pipeline to convert the abstract corpus from XML to JSON in a structure that would make it easy to import the data into a vector storage system. For additional context, we added the association of an abstract with the scientific divisions of the EGU, including co-organisation between two or more divisions. This information was not available at the time of this project and had to be scraped from archived versions of the conference online programme.
The RAG system is designed to read various model formats, such as GGUF, GPTQ, and Transformers models. It also integrates with a vector storage solution to read and use conference abstracts to provide enriched responses. Its implementation uses Apptainer for containerised execution.
The first responses from the RAG system to natural language queries produced promising results. The inclusion of links to the source materials allowed us to compare the query response with the information in the source materials. However, evaluating generative AI models is not trivial since one query can produce multiple results. Using a well-understood text corpus and being able to trace the probable origin texts of the results allows us to evaluate the quality of the results and better understand the origin of deficient RAG responses.
How to cite: Klump, J., Hille, J., Guglielmo, M., and Gardner, B.: Building a RAG system for querying a large corpus of conference abstracts, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7690, https://doi.org/10.5194/egusphere-egu25-7690, 2025.