EGU24-13715, updated on 09 Mar 2024
https://doi.org/10.5194/egusphere-egu24-13715
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Accelerating Geoscience Research: An Advanced Platform for Efficient Multimodal Data Integration from Geoscience Literature

Zhixin Guo, Jianping Zhou, Guanjie Zheng, Xinbing Wang, and Chenghu Zhou
Zhixin Guo et al.
  • Shanghai Jiao Tong University

In the era of big data science, geoscience has experienced a significant paradigm shift, moving towards a data-driven approach to scientific discovery. This shift, however, presents a considerable challenge due to the plethora of geoscience data scattered across various sources. These challenges encompass data collection and collation and the intricate database construction process. Addressing this issue, we introduce a comprehensive, publicly accessible platform designed to facilitate extracting multimodal data from geoscience literature, encompassing text, visual, and tabular formats. Furthermore, our platform streamlines the search for targeted data and enables effective knowledge fusion. A distinctive feature of it is its capability to enhance the generalizability of Deep-Time Digital Earth data processing. It achieves this by customizing standardized target data and keyword mapping vocabularies for each specific domain. This innovative approach successfully overcomes the constraints typically imposed by a need for domain-specific knowledge in data processing. The platform has been effectively applied in processing diverse data sets, including mountain disaster data, global orogenic belt isotope data, and environmental pollutant data. This has facilitated substantial academic research, evidenced by developing knowledge graphs based on mountain disaster data, establishing a global Sm-Nd isotope database, and meticulous detection and analysis of environmental pollutants. The utility of our platform is further enhanced by its sophisticated network of models, which offer a cohesive multimodal understanding of text, images, and tabular data. This functionality empowers researchers to curate and regularly update their databases meticulously with enhanced efficiency. To demonstrate the platform's practical application, we highlight a case study involving compiling Sm-Nd isotope data to create a specialized database and subsequent geographic analysis. The compilation process in this scenario is comprehensive, encompassing tasks such as PDF pre-processing, recognition of target elements, human-in-the-loop annotation, and integrating multimodal knowledge. The results obtained consistently mirror patterns found in manually compiled data, thereby reinforcing the reliability and accuracy of our automated data processing tool. As a core component of the Deep-Time Digital Earth (DDE) program, our platform has significantly contributed to the field, supporting forty geoscience research teams in their endeavors and processing over 40,000 documents. This accomplishment underscores the platform's capacity for handling large-scale data and its pivotal role in advancing geoscience research in the age of big data.

How to cite: Guo, Z., Zhou, J., Zheng, G., Wang, X., and Zhou, C.: Accelerating Geoscience Research: An Advanced Platform for Efficient Multimodal Data Integration from Geoscience Literature, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-13715, https://doi.org/10.5194/egusphere-egu24-13715, 2024.