EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

GEOTEK: Extracting Marine Geological Data from Publications

Muhammad Asif Suryani1,2, Christian Beth1, Klaus Wallmann2, and Matthias Renz1
Muhammad Asif Suryani et al.
  • 1Institute of Informatik, Christian-Albrecht University of Kiel, Kiel, Germany, mas,cbe,
  • 2FE Marine Geosysteme, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany, msuryani,

In Marine Geology, scientists persistently perform extensive experiments to measure diverse features across the globe, hence to estimate environmental changes. For example, Mass Accumulation Rate (MAR) and Sedimentation Rate (SR) are measured by marine geologists at various oceanographic locations and are largely reported in research publications but have not been compiled in any central database. Furthermore, every MAR and SR observation normally carries i) exact locational information (Longitude and Latitude), ii) the method of measurement (stratigraphy, 210Pb), iii) a numerical value and units (2.4 g/m2/yr), iv) temporal feature (e.g. hundred years ago). The contextual information attached to MAR and SR observations is heterogeneous and manual approaches for information extraction from text are infeasible. It is also worth mentioning that MAR and SR are not denoted in standard international (SI) units.

We propose the comprehensive end-to-end framework GEOTEK (Geological Text to Knowledge) to extract targeted information from marine geology publications. The proposed framework comprises three modules. The first module carries a document relevance model alongside a PDF extractor, capable of filtering relevant sources using metadata, and the extraction module extracts text, tables, and metadata respectively. The second module mainly comprises of two information extractors, namely Geo-Quantities and Geo-Spacy, particularly trained on text from the Marine Geology domain. Geo-Quantities is capable of extracting relevant numerical information from the text and covers more than 100 unit variants for MAR and SR, while Geo-Spacy extracts a set of relevant named entities as well as locational entities, which are further processed to obtain respective geocode boundaries. The third module, the Heterogeneous Information Linking module (HIL), processes exact spatial information from tables and captions and forms links to the previously extracted measurements. Finally, the all-linked information is populated in an interactive map view.

How to cite: Suryani, M. A., Beth, C., Wallmann, K., and Renz, M.: GEOTEK: Extracting Marine Geological Data from Publications, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-16252,, 2023.