- 1Instituto Pirenaico de Ecología (IPE-CSIC), Zaragoza, Spain
- 2Departamento de Física, Universidad de Extremadura, Badajoz, Spain
- 3Estación Experimental Aula Dei (EEAD-CSIC), Zaragoza, Spain
Newspapers are a highly valuable documentary source to study extreme climate events, their impacts, and the measures societies have taken for mitigation or adaptation. However, the huge volume of published newspapers and the diversity of topics they cover make the manual extraction of this information extremely time-consuming and costly. Large Language Models (LLMs) have shown high capabilities for information retrieval and extraction from digital newspapers. Nevertheless, only a limited number of studies have evaluated their performance on historical archives available exclusively in paper form, later digitized through scanning and Optical Character Recognition (OCR). In these cases, the resulting text layer often contains multiple types of errors, e.g. character-level mistakes (confused letters/numbers, missing accents), broken or merged words, and loss of document structure (incorrect reading order, irregular line breaks, hyphenation artifacts, or disordered tables); which can strongly affect extraction performance.
In this study, we evaluate the ability of LLMs to extract drought impacts from the historical archive of two Spanish newspapers, Hoy and El Periódico de Extremadura, covering the period 1923–1993 (995,558 pages). A manual annotation was carried out to identify both drought related news and their impacts on water resources, energy, agriculture, and livestock.
We use the CienaLLM framework, which provides configurable prompt pipelines designed for structured extraction of climatic events and their impacts from news articles. This enables the orchestration of prompt engineering strategies such as Chain of Thought reasoning, structured output generation, self-criticism, and optional summarization steps.
We assess six open-source LLMs: qwen2.5:3b, qwen2.5:7b, qwen2.5:72b, qwen3:8b, qwen3:30b, and deepseek-r1:8b. Each model is tested under three configurations: no-summary, summary, and expert-summary. In no-summary, extraction is performed directly from the OCR text. In the remaining configurations, the model is asked to summarize the page before extraction: summary generates a general summary, while expert-summary focuses specifically on drought-related information.
Results show that, in general, larger models achieve better performance, and that adding a prior summarization step does not lead to significant improvements for these models. As expected, we find that text quality is a key factor controlling extraction success. To quantify this aspect, we propose the Unknown Words Ratio as a proxy indicator of text quality, and we compute minimum threshold values required to ensure successful extraction of information.
How to cite: Vilas, D., Vaquero, J. M., Díaz Codiño, L., Latorre, B., and Domínguez-Castro, F.: Benchmarking Open-Source LLMs for Drought Impact Extraction in Historical Newspapers, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-23012, https://doi.org/10.5194/egusphere-egu26-23012, 2026.