- 1SISTEMA GmbH, Vienna, Austria (natali@sistema.at)
- 2ESA ESRIN, Frscati, Italy (Florian.Widmer@esa.int)
The combined use of satellite data, model output, and other geospatial information layers requires a wide set of multidisciplinary skills that are often hard to find all together among scientists who are more educated in e.g. understanding, managing, or responding to natural and climate change-connected events. On these premises, there is a need to develop tools that enable non-specialists to access and exploit the increasing capabilities emerging from the fusion of different Earth Observation (EO)-based and other geospatial data.
In response to this, the TheDe project aims to create a new type of data dissemination service that enables the automatic generation of thematic datacubes on demand. It integrates Earth Observation (e.g., Copernicus products) and other geospatial environmental data with Large Language Models (LLMs) and semantic interpretation, transforming diverse datasets into accessible, meaningful information for both domain experts and a broader audience.
TheDe acts as an AI assistant that, through a chatbot interface, receives a human language query related to a specific EO task and provides the corresponding data, metadata, and descriptions, ready for download in user-specified formats. Specifically, the query is processed by a tailored LLM framework that transforms human language into complex geospatial queries, mapping high-level EO tasks into concrete data requests. The system then identifies the relevant geospatial datasets and calls the appropriate APIs (e.g., Copernicus CDSE/CDS/ADS, NASA FIRMS, ESA Open Access Hub, etc.). Once the datasets are obtained, the LLM uses the metadata to generate context-rich descriptions that offer practical guidance to the user which are delivered together with the corresponding datasets.
During the system architecture design, a detailed study of the state of the art was conducted, focusing on evaluating the performance of open-source LLMs for EO reasoning through dedicated benchmarks. In parallel, different system architectures were explored, with particular attention to agentic frameworks. Specific techniques such as Retrieval-Augmented Generation (RAG), fine-tuning, and prompt engineering were analysed to enhance the specialization of the various components. Therefore, on top of these studies, an innovative model is proposed for EO data discovery and exploitation.
The preliminary outcomes show promising alignment with current sector needs and developments. TheDe introduces the capability to access not only widely used EO data but also their combination with other heterogeneous data sources, facilitating interoperability and scalability.
Finally, TheDe aims to bridge the gap between data systems to support advanced data mining activities beyond traditional Earth Observation services. For this reason, new types of use-cases are proposed representing innovative EO applications that, in the long term, can leverage the potentials of TheDe.
How to cite: Natali, S., Bellotto, E. K., El Azami El Adli, W., Widmer, F., and Van Bemmelen, J.: Thematic DataCubes on-Demand (TheDe): Leveraging Large Language Models (LLMs) for Earth Observation Data Discovery and Exploitation, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-5722, https://doi.org/10.5194/egusphere-egu26-5722, 2026.