EGU26-2509, updated on 13 Mar 2026
https://doi.org/10.5194/egusphere-egu26-2509
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Tuesday, 05 May, 14:45–14:55 (CEST)
 
Room 0.96/97
Using Multimodal LLMs for Digitising Handwritten Climate Records
Marlies van der Schee1, Kirien Whan1, Teun Peeters2, Yuliya Shapovalova2, Jacco van Ekris1, Irene Garcia Marti1, Bert Bergman1, and Karlijn Zaanen1
Marlies van der Schee et al.
  • 1Royal Netherlands Meteorological Institute (KNMI), De Bilt, The Netherlands
  • 2Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands

Millions of historic handwritten weather observations remain locked in paper records, leaving this valuable information inaccessible for analysis and at risk of permanent loss. Manual transcription of these records is highly accurate but time-consuming and costly, making this a task where AI could play a pivotal role. Traditional optical character recognition (OCR) methods struggle with the irregularities of historical handwriting and tabular layouts. This study proposes a novel automated digitization pipeline that leverages multimodal large language models (MLLMs) alongside table structure recognition (TSR) and OCR techniques to transcribe handwritten climate records efficiently and accurately.

First, we compare two MLLMs, and find that by guiding the MLLM with structured prompts and validating outputs based on physical relationships between meteorological variables, we achieve transcription precisions of up to 97%. This rivals human accuracy, though at the cost of a lower inclusion rate due to strict filtering. Second, we link MLLM outputs to detected table structures to generate training data for fine-tuning a pretrained OCR model. Fine-tuning significantly enhances transcription quality, improving from 19% to 81% on unseen data. Challenges remain due to the complexities of TSR in historical documents, reducing the quality of our training data. Despite these limitations, our research establishes a viable framework for scaling data rescue efforts, bringing us one step closer to unlocking centuries of climate data for scientific analysis. 

How to cite: van der Schee, M., Whan, K., Peeters, T., Shapovalova, Y., van Ekris, J., Garcia Marti, I., Bergman, B., and Zaanen, K.: Using Multimodal LLMs for Digitising Handwritten Climate Records, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-2509, https://doi.org/10.5194/egusphere-egu26-2509, 2026.