EGU24-19139, updated on 09 Apr 2024
https://doi.org/10.5194/egusphere-egu24-19139
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Data Bridges: Modeling Marine Science Information to Heterogeneous Information Network for Research Data Management

Muhammad Asif Suryani1,2,3, Ewa Burwicz-Galerne4, Klaus Wallmann2, and Matthias Renz1
Muhammad Asif Suryani et al.
  • 1Christian-Albrecht University of Kiel, Institute of Informatik, Kiel, Germany (asifsuryani@gmail.com)
  • 2GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
  • 3GESIS- Leibniz Institute for the Social Sciences, Köln, Germany
  • 4MARUM - Center for Marine Environmental Sciences, University of Bremen, Germany

Research Data Management (RDM) in Natural Science establishes a structured foundation for organizing and preserving scientific data. Effective management and access to these diverse data sources are crucial for supporting domain scientists in future knowledge discovery. Scientific publications, a primary data source often presented in Portable Document Format (PDF), serve as a rich source of information, encompassing text, tables, figures, and metadata. These components present information individually or collectively, offering the potential to explore exciting research directions. However, to fully address these aspects, it is necessary to be able to perform data acquisition from these publications, focusing on these data components, and conducting respective information extraction. Furthermore, modeling the extracted information into a Heterogeneous Information Network of publications enhances accessibility, collaboration, and information harvesting within the natural sciences domain.

We developed a comprehensive framework ensuring user accessibility and widespread applicability, which is capable of modeling diverse information from marine science publications into a Heterogeneous Information Network. The framework comprises three modules: Data Acquisition, Information Extraction, and Information Modeling. The Data Acquisition (DA) module extracts various data components from the relevant publications and transforms them into machine-readable formats. The Information Extraction (IE) module includes two sub-modules: Named Entity Recognition (NER) modules trained on marine science annotated text, capable of extracting eight different types of entities from plain text; and an information parser module responsible for extracting quantitative information from tabular data. It initially detects and then extracts scientific measurements, relevant spatial information, and other available characteristics. Finally, the information modeling module exhibits the extracted information from data components and performs information linking. Consequently, the information is structured into a Heterogeneous Information Network (HIN) of scientific publications, ensuring effective information delivery and providing diverse information to domain experts while supporting the Research Data Management initiative.

How to cite: Suryani, M. A., Burwicz-Galerne, E., Wallmann, K., and Renz, M.: Data Bridges: Modeling Marine Science Information to Heterogeneous Information Network for Research Data Management, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-19139, https://doi.org/10.5194/egusphere-egu24-19139, 2024.