EGU25-578, updated on 14 Mar 2025
https://doi.org/10.5194/egusphere-egu25-578
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
PICO | Thursday, 01 May, 10:49–10:51 (CEST)
 
PICO spot 2, PICO2.3
“Old Texts, New Tech, Better Theory”: Applying Machine Learning to Textual Weather Data from Historical Ship Logbooks 
Livia Stein Freitas1,2, Theo Carr1,3, Tessa Giacoppo1,4, Timothy Walker1,5, and Caroline Ummenhofer1
Livia Stein Freitas et al.
  • 1Department of Physical Oceanography, Woods Hole Oceanographic Institution, Woods Hole, MA, USA
  • 2Department of Computer Science, Grinnell College, Grinnell, IA, USA
  • 3Massachusetts Institute of Technology–Woods Hole Oceanographic Institution Joint Program in Oceanography/Applied Ocean Science and Engineering, Cambridge and Woods Hole, MA, USA
  • 4Department of Earth and Environmental Science, Dalhousie University, Halifax, NS, CA
  • 5Department of History, University of Massachusetts Dartmouth, Dartmouth, MA, USA

During oceanic expeditions, pre-modern sailors meticulously recorded information about their longitude and latitude, the local wind conditions, and the state of the sea. For a long time, prior to precision instrumentation, sailors provided qualitative recordings of wind speed instead of quantitative (e.g.: “light breeze” instead of 5 meters/second). For that reason, this textual data requires additional processing before being usable for comparison with modern instrumental data or reanalysis products. In particular, the phrases used in wind descriptions can be classified using the Beaufort Wind Force Scale (codified in 1805), that consists of thirteen base wind force levels assigned a numerical value. Manually categorizing all the distinct and unique variations on the wind information can be ambiguous and time consuming. Because of historical weather data’s importance for climate science, we investigated if machine learning could speed up this process while producing accurate results.

Using a novel dataset of >100,000 (sub)daily maritime weather recordings from historical whaling ship logbooks housed across New England archives and covering the period 1820-1890, here we show that k-means nearest neighbors and density based spatial clustering models, while efficient, generate outputs with reduced accuracy when compared to the data classified by humans. However, there is a noticeable improvement in the quality of the clustering when we introduce the Beaufort Wind Force Scale’s thirteen categories as starting centroids. These results show that machine learning could be a useful tool for wind term processing and that well-placed human input aids in the accuracy of outcomes. Therefore, cross-validation methods are employed to help with the interpretability of the machine models utilized. Additionally, various neural network clustering models are evaluated regarding their efficacy, such as a two sliding windows text GNN-based (TSW-GNN) model, since its graph-based approach has demonstrated improved accuracy in classifying textual data as compared to language representation models.

How to cite: Stein Freitas, L., Carr, T., Giacoppo, T., Walker, T., and Ummenhofer, C.: “Old Texts, New Tech, Better Theory”: Applying Machine Learning to Textual Weather Data from Historical Ship Logbooks , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-578, https://doi.org/10.5194/egusphere-egu25-578, 2025.