EGU25-16268, updated on 15 Mar 2025
https://doi.org/10.5194/egusphere-egu25-16268
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
PICO | Thursday, 01 May, 08:51–08:53 (CEST)
 
PICO spot 2
Leveraging Deep Learning and Natural Language Processing for hydrogeological insights from borehole logs
Alberto Previati, Valerio Silvestri, and Giovanni Crosta
Alberto Previati et al.
  • University of Milan - Bicocca, Department of Earth and Environmental Sciences, Milano, Italy (alberto.previati@unimib.it)

The advent of extensive digital datasets coupled with advancements in artificial intelligence (AI) is revolutionizing our ability to extract meaningful insights from complex patterns in natural sciences. In this context, the targeted classification of textual descriptions, particularly those detailing the granulometry of unconsolidated sediments or the fracturing state of rock masses, combining supervised deep learning and natural language processing (NLP) is a promising method to refine large-scale geological and hydrogeological models by enriching them with increased data volume.

Several databases are replete with qualitative geological data such as borehole logs, which, while abundant, are not readily assimilated into quantitative hydrogeological modeling due to the extensive time required to process the written descriptions into operationally significant units like hydrofacies. This conversion typically necessitates expert analysis of each report but can be expedited through the application of NLP techniques rooted in AI.

The primary objectives of this research are twofold: (i) to develop a robust classification model that leverages geological descriptions alongside grain size data, and (ii) to standardize a vast array of sparse and heterogeneous stratigraphic log data for integration into large-scale hydrogeological applications.

The Po River alluvial plain in northern Italy (45,700 km²) serves as the pilot area for this study due to the homogeneous shallow subsurface geology, the dense borehole coverage and the availability of a pre-labelled training set. This research demonstrates the conversion of qualitative geological information from a very large dataset of stratigraphic logs (encompassing 387,297 text descriptions from 39,265 boreholes), into a dataset of semi-quantitative information. This transformation, primed for hydrogeological modeling, is facilitated by an operational classification system using a deep learning-based NLP algorithm to categorize complex geological and lithostratigraphic text descriptions according to grain size-based hydrofacies. A supervised text classification algorithm, founded on a Long-Short Term Memory (LSTM) architecture was meticulously developed, trained and validated using 86,611 pre-labelled entries encompassing all sediment types within the study region. The word embedding technique enhanced the model accuracy and learning efficiency by quantifying the semantic distances among geological terms.

The outcome of this work is a novel dataset of semi-quantitative hydrogeological information, boasting a classification model accuracy of 97.4%. This dataset was incorporated into expansive modeling frameworks, enabling the assignment of hydrogeological parameters based on grain size data, integrating the uncertainty stemming from misclassification. This has markedly increased the spatial density of available information from 0.34 data points/km² to 8.7 data points/km². The study findings align closely with the existing literature, offering a robust spatial reconstruction of hydrofacies at different scales. This has significant implications for groundwater research, particularly in the realm of quantitative modeling at a regional scale.

How to cite: Previati, A., Silvestri, V., and Crosta, G.: Leveraging Deep Learning and Natural Language Processing for hydrogeological insights from borehole logs, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-16268, https://doi.org/10.5194/egusphere-egu25-16268, 2025.