- 52°North Spatial Information Research GmbH, Münster, Germany.
Spatial Data Infrastructures (SDIs) contain a lot of spatial data from various organizations and data producers. Metadata is intended to enable the discovery of the data, yet finding the relevant data can be challenging. The challenges include rigid keyword-search, complex search interfaces in geoportals, map-based search that require some geographic knowledge, and language differences between user queries and the metadata.
The development of Large Language Models (LLMs) offers new opportunities to improve spatial data discovery. LLMs demonstrate strong language understanding and generation capabilities and have been used in information retrieval tasks. They can overcome semantic differences and language barriers between user queries and the needed information. However, their internal knowledge is limited and they are prone to hallucinations. Unless the datasets in SDIs, or the web pages describing them are indexed by search engines, LLMs with internet search tools cannot find them.
Retrieval-Augmented Generation (RAG) offers a solution for the knowledge limitations, by connecting an LLM with an external and up-to-date knowledge base. However, RAG mainly works in the textual domain and excels at retrieving external information that is semantically relevant to a user query. Queries for geographic data have a spatial aspect yet the spatial reasoning capabilities of LLMs are limited. For a query like “forest data for Vienna”, RAG can identify the relevant forest data from a pool of metadata, regardless of the language or words used to describe the data. However, identifying datasets that meet the spatial intent is a problem. DCAT metadata, the most popular metadata standard, defines the spatial extent of spatial datasets using bounding box coordinates or as links to gazetteers. Naive RAG is based on semantic similarity approaches. An LLM can identify “Vienna” as a location, but would struggle to identify datasets relevant to the location, as there is little semantic similarity between the location name and coordinate digits or gazetteer links. There is thus a need to incorporate spatial indexing techniques for improved spatial reasoning.
With our contribution we present an approach that combines LLMs, RAG, and spatial indexing techniques to overcome existing challenges in discovering spatial data in SDIs, and improve spatial data discovery through natural language queries.
How to cite: Ondieki, J. O., Rieke, M., and Jirka, S.: Using Large Language Models to Enhance Spatial Data Discovery in Spatial Data Infrastructures, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20221, https://doi.org/10.5194/egusphere-egu26-20221, 2026.