ESSI1.2

Spatio-temporal Data Science: Theoretical Advances and Applications in AI and ML

ESSI1.2

Spatio-temporal Data Science: Theoretical Advances and Applications in AI and ML

Co-organized by GI2/NP4, co-sponsored by AGU

Convener: Christopher KadowECSECS | Co-conveners: Jens Klump, Luigi Lombardo, Federico AmatoECSECS, Ge Peng

Presentations

| Tue, 24 May, 10:20–11:50 (CEST)

Room 0.31/32

Presentations: Tue, 24 May, 10:20–11:50 | Room 0.31/32

Chairpersons: Christopher Kadow, Luigi Lombardo, Ge Peng

10:20–10:26

EGU22-10799

On-site presentation

Design Considerations for the 3rd Spatial Dimension of the Spatiotemporal Adaptive Resolution Encoding (STARE)

Michael Rilee and Kwo-Sen Kuo

The real world does not live on a regular grid. The observations with the best spatiotemporal resolution are generally irregularly distributed over space and time, even though as data they are generally stored in arrays in files. Storing the diverse data types of Earth science, including grid, swath, and point based spatiotemporal distributions, in separate files leads to computer-native array layouts on disk or working memory having little or no connection with the spatiotemporal layout of the observations themselves. For integrative analysis, data must be co-aligned both spatiotemporally and in computer memory, a process called data harmonization. For data harmonization to be scalable in both diversity and volume, data movement must be minimized. The SpatioTemporal Adaptive Resolution Encoding (STARE) is a hierarchical, recursively subdivided indexing scheme for harmonizing diverse data at scale.

STARE indices are integers embedded with spatiotemporal attributes key to efficient spatiotemporal analysis. As a more computationally efficient alternative to conventional floating-point spatiotemporal references, STARE indices apply uniformly to all spatiotemporal data regardless of their geometric layouts. Through this unified reference, STARE harmonizes diverse data in their native states to enable integrative analysis without requiring homogenization of the data by interpolating them to a common grid first.

The current implementation of STARE supports solid angle indexing, i.e. longitude-latitude, and time. To fully support Earth science applications, STARE must be extended to indexing the radial dimension for a full 4D spatiotemporal indexing. As STARE’s scalability is based on having a universal encoding scheme mapping spatiotemporal volumes to integers, the variety of existing approaches to encoding the radial dimension arising in Earth science raises complex design issues for applying STARE’s principles. For example, the radial dimension can be usefully expressed via length (altitude) or pressure coordinates. Both length and pressure raise the question as to what reference surface should be used. As STARE’s goal is to harmonize different kinds of data, we must determine whether it is better to have separate radial scale encodings for length and pressure, or should we have a single radial encoding, for which we provide tools for translating between various (radial) coordinate systems. The questions become more complex when we consider the wide range of Earth science data and applications, including, for example, model simulation output, lidar point clouds, spacecraft swath data, aircraft in-situ measurements, vertical or oblique parameter retrievals, and earthquake-induced movement detection.

In this work, we will review STARE’s unifying principle and the unique nature of the radial dimension. We will discuss the challenges of enabling scalable Earth science data harmonization in both diversity and volume, particularly in the context of detection, cataloging, and statistical study of fully 4D hierarchical phenomena events such as extratropical cyclones. With the twin challenges of exascale computing and increasing model simulation resolutions opening new views into physical processes, scalable methods for bringing best-resolution observations and simulations together, like STARE, are becoming increasingly important.

How to cite: Rilee, M. and Kuo, K.-S.: Design Considerations for the 3rd Spatial Dimension of the Spatiotemporal Adaptive Resolution Encoding (STARE), EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10799, https://doi.org/10.5194/egusphere-egu22-10799, 2022.

10:26–10:32

EGU22-1346

On-site presentation

Enhance pluvial flood risk assessment using spatio-temporal machine learning models

Andrea Critto, Marco Zanetti, Elena Allegri, Anna Sperotto, and Silvia Torresan

Extreme weather events (e.g., heavy rainfall) are natural hazards that pose increasing threats to many sectors and across sub-regions worldwide (IPCC, 2014), exposing people and assets to damaging effects. In order to predict pluvial flood risks under different spatio-temporal conditions, three generalized Machine Learning models were developed and applied to the Metropolitan City of Venice: Logistic Regression, Neural Networks and Random Forest. The models considered 60 historical pluvial flood events, occurred in the timeframe 1995-2020. The historical events helped to identify and prioritize sub-areas that are more likely to be affected by pluvial flood risk due to heavy precipitation. In addition, while developing the model, 13 triggering factors have been selected and assessed: aspect, curvature, distance to river, distance to road, distance to sea, elevation, land use, NDVI, permeability, precipitation, slope, soil and texture. A forward features selection method was applied to understand which features better face spatio-temporal overfitting in pluvial flood prediction based on AUC score. Results of the analysis showed that the most accurate models were obtained with the Logistic Regression approach, which was used to provide pluvial flood risk maps for each of the 60 major historical events occurred in the case study area. The model showed high accuracy and most of the occured events in the Metropolitan City of Venice have been properly predicted, demostrating that Machine Learning could substantially improve and speed up disaster risk assessment and mapping helping in overcoming most common bottlenecks of physically-based simulations such as the computational complexity and the need of large datasets of high-resolution information.

How to cite: Critto, A., Zanetti, M., Allegri, E., Sperotto, A., and Torresan, S.: Enhance pluvial flood risk assessment using spatio-temporal machine learning models, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-1346, https://doi.org/10.5194/egusphere-egu22-1346, 2022.

10:32–10:38

EGU22-10823

ECS

Virtual presentation

Scalable Feature Extraction and Tracking (SCAFET): A general framework for feature extraction from large climate datasets

Arjun Nellikkattil, June-Yi Lee, and Axel Timmermann

The study describes a generalized framework to extract and track features from large climate datasets. Unlike other feature extraction algorithms, Scalable Feature Extraction and Tracking (SCAFET) is independent of any physical thresholds making it more suitable for comparing features from different datasets. Features of interest are extracted by segmenting the data on the basis of a scale-independent bounded variable called shape index (Si). Si gives a quantitative measurement of the local shape of the field with respect to its surroundings. To illustrate the capabilities of the method, we have employed it in the extraction of different types of features. Cyclones and atmospheric rivers are extracted from the ERA5 reanalysis dataset to show how the algorithm extracts points as well as surfaces from climate datasets. Extraction of sea surface temperature fronts depicts how SCAFET handles unstructured grids. Lastly, the 3D structures of jetstreams is extracted to demonstrate that the algorithm can extract 3D features too. The detection algorithm is implemented as a jupyter notebook[https://colab.research.google.com/drive/1D0rWNQZrIfLEmeUYshzqyqiR7QNS0Hm-?usp=sharing] accessible to anyone to test out the algorithm.

How to cite: Nellikkattil, A., Lee, J.-Y., and Timmermann, A.: Scalable Feature Extraction and Tracking (SCAFET): A general framework for feature extraction from large climate datasets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10823, https://doi.org/10.5194/egusphere-egu22-10823, 2022.

10:38–10:44

EGU22-3855

Virtual presentation

CGC: an open-source Python module for geospatial data clustering

Ou Ku, Francesco Nattino, Meiert Grootes, Emma Izquierdo-Verdiguier, Serkan Girgin, and Raul Zurita-Milla

10:44–10:50

EGU22-3940

Virtual presentation

The Analysis of the Aftershock Sequence of the Recent Mainshock in Arkalochori, Crete Island Greece

Alexandra Moshou, Antonios Konstantaras, and Panagiotis Argyrakis

Forecasting the evolution of natural hazards is a critical problem in natural sciences. Earthquake forecasting is one such example and is a difficult task due to the complexity of the occurrence of earthquakes. Until today, earthquake prediction is based on the time before the occurrence of the main earthquake and is based mainly on empirical methods and specifically on the seismic history of a given area. Τhe analysis and processing of its seismicity play a critical role in modern statistical seismology. In this work, a first attempt is made to study and draw safe conclusions regarding the prediction for the seismic sequence, specifically using appropriate statistical methods like Bayesian predictive, taking into account the uncertainties of the model parameters. The above theory was applied in the recent seismic sequence in the area of Arkalochori in Crete Island, Greece (2021, M_w 6.0). Τhe rich seismic sequence that took place immediately after the main 5.6R earthquake with a total of events for the next three months, approximately 4,000 events of magnitude M_L> 1 allowed calculating the probability of having the most significant expected earthquake during a given time as well as calculating the probability that the most significant aftershock is expected to be above a certain magnitude after a major earthquake.

References:

Ganas, A., Fassoulas, C., Moshou, A., Bozionelos, G., Papathanassiou, G., Tsimi, C., & Valkaniotis, S. (2017). Geological and seismological evidence for NW-SE crustal extension at the southern margin of Heraklion basin, Crete. Bulletin of the Geological Society of Greece, 51, 52-75. doi: https://doi.org/10.12681/bgsg.15004
Konstantaras, A.J. (2016). Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters. Earth Science Informatics. 9 (1), 95-100.
Konstantaras, A. (2020). Deep learning and parallel processing spatio-temporal clustering unveil new Ionian distinct seismic zone. Informatics. 7 (4), 39.
Moshou, A., Papadimitriou, E., Drakatos, G., Evangelidis, C., Karakostas, V., Vallianatos, F., & Makropoulos, K. (2014, May). Focal Mechanisms at the convergent plate boundary in Southern Aegean, Greece. In EGU General Assembly Conference Abstracts (p. 12185)
Moshou, A., Argyrakis, P., Konstantaras, A., Daverona, A.C. & Sagias, N.C. (2021). Characteristics of Recent Aftershocks Sequences (2014, 2015, 2018) Derived from New Seismological and Geodetic Data on the Ionian Islands, Greece. 6 (2), 8.
C.B., Nolet. G., 1997. P and S velocity structure of the Hellenic area obtained by robust nonlinear inversion of travel times. J. Geophys. Res. 102 (8). 349–367

How to cite: Moshou, A., Konstantaras, A., and Argyrakis, P.: The Analysis of the Aftershock Sequence of the Recent Mainshock in Arkalochori, Crete Island Greece, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3940, https://doi.org/10.5194/egusphere-egu22-3940, 2022.

10:50–10:56

EGU22-5487

Virtual presentation

3D Mapping of Active Underground Faults Enabled by Heterogeneous Parallel Processing Spatio-Temporal Proximity and Clustering Algorithms

Alexandra Moshou, Antonios Konstantaras, Nikitas Menounos, and Panagiotis Argyrakis

Underground faults cast energy storage elements of the accumulated strain energy in border areas of active tectonic plates. Particularly in the southern front of the Hellenic seismic arc, a steady yearly flow in the accumulation of strain energy is being due to the constant rate of motion at which the African plate sub-sinks beneath the Eurasian plate. Partial release of the stored energy from a particular underground fold manifests in the form of an earthquake once reaching the surface of the Earth’s crust. The information obtained for each recorded earthquake includes among others the surface location and the estimated hypocentre depth. Considering that hundreds of thousands earthquakes have been recorded in that particular area, the accumulated hypocentre depths provide a most valuable source of information regarding the in-depth extent of the seismically active parts of the underground faults. This research work applies expert knowledge spatio-temporal clustering in previously reported distinct seismic cluster zones, aiming to associate each individual main earthquake along with its recoded foreshocks and aftershocks to a single underground fault in existing two-dimensional mappings. This process is being enabled by heterogeneous parallel processing algorithms encompassing both proximity and agglomerative density-based clustering algorithms upon main seismic events only to mapped. Once a main earthquake is being associated to a particular known underground fault, then the underground fault’s point with maximum proximity to the earthquake’s hypocentre appends its location parameters, additionally incorporating the dimension of depth to the initial planar dimensions of latitude and longitude. The ranges of depth variations provide a notable indication of the in-depth extent of the seismically active part(s) of underground faults enabling their 3D model mapping.

Indexing terms: spatio-temporal proximity and clustering algorithms, heterogeneous parallel processing, Cuda, 3D underground faults’ mapping

References

Axaridou A., I. Chrysakis, C. Georgis, M. Theodoridou, M. Doerr, A. Konstantaras, and E. Maravelakis. 3D-SYSTEK: Recording and exploiting the production workflow of 3D-models in cultural heritage. IISA 2014 - 5th International Conference on Information, Intelligence, Systems and Applications, 51-56, 2014.

Konstantaras A. Deep learning and parallel processing spatio-temporal clustering unveil new Ionian distinct seismic zone. Informatics. 7 (4), 39, 2020.

Konstantaras A.J. Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters. Earth Science Informatics. 9 (1), 95-100, 2016.

Konstantaras A.J., E. Katsifarakis, E. Maravelakis, E. Skounakis, E. Kokkinos and E. Karapidakis. Intelligent spatial-clustering of seismicity in the vicinity of the Hellenic Seismic Arc. Earth Science Research 1 (2), 1-10, 2012.

Konstantaras A., F. Valianatos, M.R. Varley, J.P. Makris. Soft-Computing modelling of seismicity in the southern Hellenic Arc. IEEE Geoscience and Remote Sensing Letters, 5 (3), 323-327, 2008.

Konstantaras A., M.R. Varley, F. Valianatos, G. Collins and P. Holifield. Recognition of electric earthquake precursors using neuro-fuzzy methods: methodology and simulation results. Proc. IASTED Int. Conf. Signal Processing, Pattern Recognition and Applications (SPPRA 2002), Crete, Greece, 303-308, 2002.

Maravelakis E., A. Konstantaras, K. Kabassi, I. Chrysakis, C. Georgis and A. Axaridou. 3DSYSTEK web-based point cloud viewer. IISA 2014 - 5th International Conference on Information, Intelligence, Systems and Applications, 262-266, 2014.

How to cite: Moshou, A., Konstantaras, A., Menounos, N., and Argyrakis, P.: 3D Mapping of Active Underground Faults Enabled by Heterogeneous Parallel Processing Spatio-Temporal Proximity and Clustering Algorithms, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-5487, https://doi.org/10.5194/egusphere-egu22-5487, 2022.

10:56–11:02

EGU22-6955

On-site presentation

Novel approaches to model assessment and interpretation in geospatial machine learning

Alexander Brenning

As the interpretability and explainability of artificial intelligence decisions has been gaining attention, novel approaches are needed to develop diagnostic tools that account for the unique challenges of geospatial and environmental data, including spatial dependence and high dimensionality, which are addressed in this contribution. Building upon the geostatistical tradition of distance-based measures, spatial prediction error profiles (SPEPs) and spatial variable importance proles (SVIPs) are introduced as novel model-agnostic assessment and interpretation tools that explore the behavior of models at different prediction horizons. Moreover, to address the challenges of interpreting the joint effects of strongly correlated or high-dimensional features, often found in environmental modeling and remote sensing, a model-agnostic approach is developed that distills aggregated relationships from complex models. The utility of these techniques is demonstrated in two case studies representing a regionalization task in an environmental-science context, and a classification task from multitemporal remote sensing of land use. In these case studies, SPEPs and SVIPs successfully highlight differences and surprising similarities of geostatistical methods, linear models, random forest, and hybrid algorithms. With 64 correlated features in the remote-sensing case study, the transformation-based interpretation approach successfully summarizes high-dimensional relationships in a small number of diagrams.

The novel diagnostic tools enrich the toolkit of geospatial data science, and may improve machine-learning model interpretation, selection, and design.

How to cite: Brenning, A.: Novel approaches to model assessment and interpretation in geospatial machine learning, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-6955, https://doi.org/10.5194/egusphere-egu22-6955, 2022.

11:02–11:08

EGU22-8648

ECS

Presentation form not yet defined

A graph-based fractality index to characterize complexity of urban form using deep graph convolutional neural networks

Lei Ma, Stefan Seipel, S. Anders Brandt, and Ding Ma

11:08–11:14

EGU22-7529

ECS

Presentation form not yet defined

Global maps from local data: Towards globally applicable spatial prediction models

Marvin Ludwig, Álvaro Moreno Martínez, Norbert Hölzel, Edzer Pebesma, and Hanna Meyer

Global-scale maps are an important tool to provide ecologically relevant environmental variables to researchers and decision makers. Usually, these maps are created by training a machine learning algorithm on field-sampled reference data and the application of the resulting model to associated information from satellite imagery or globally available environmental predictors. However, field samples are often sparse and clustered in geographic space, representing only parts of the global environment. Machine learning models are therefore prone to overfit to the specific environments they are trained on - especially when a large set of predictor variables is utilized. Consequently, model validations have to include an analysis of the models transferability to regions where no training samples are available e.g. by computing the Area of Applicability (AOA, Meyer and Pebesma 2021) of the prediction models.

Here we reproduce three recently published global environmental maps (soil nematode abundances, potential tree cover and specific leaf area) and assess their AOA. We then present a workflow to increase the AOA (i.e. transferability) of the machine learning models. The workflow utilizes spatial variable selection in order to train generalized models which include only predictors that are most suitable for predictions in regions without training samples. We compared the results to the three original studies in terms of prediction performance and AOA. Results indicate that reducing predictors to those relevant for spatial prediction, leads to a significant increase of model transferability without significant decrease of the prediction quality in areas with high sampling density.

Meyer, H. & Pebesma, E. Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution 2041–210X.13650 (2021) doi:10.1111/2041-210X.13650.

How to cite: Ludwig, M., Moreno Martínez, Á., Hölzel, N., Pebesma, E., and Meyer, H.: Global maps from local data: Towards globally applicable spatial prediction models, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7529, https://doi.org/10.5194/egusphere-egu22-7529, 2022.

Discussion

11:14–11:20

EGU22-8891

ECS

Presentation form not yet defined

Infilling Spatial Precipitation Recordings with a Memory-Assisted CNN

Johannes Meuer, Laurens Bouwer, Étienne Plésiat, Roman Lehmann, Markus Hoffmann, Thomas Ludwig, Wolfgang Karl, and Christopher Kadow

Missing climate data is a widespread problem in climate science and leads to uncertainty of prediction models that rely on these data resources. So far, existing approaches for infilling missing precipitation data are mostly numerical or statistical techniques that require considerable computational resources and are not suitable for large regions with missing data. Most recently, there have been several approaches to infill missing climate data with machine learning methods such as convolutional neural networks or generative adversarial networks. They have proven to perform well on infilling missing temperature or satellite data. However, these techniques consider only spatial variability in the data whereas precipitation data is much more variable in both space and time. Rainfall extremes with high amplitudes play an important role. We propose a convolutional inpainting network that additionally considers a memory module. One approach investigates the temporal variability in the missing data regions using a long-short term memory. An attention-based module has also been added to the technology to consider further atmospheric variables provided by reanalysis data. The model was trained and evaluated on the RADOLAN data set which is based on radar precipitation recordings and weather station measurements. With the method we are able to complete gaps in this high quality, highly resolved spatial precipitation data set over Germany. In conclusion, we compare our approach to statistical techniques for infilling precipitation data as well as other state-of-the-art machine learning techniques. This well-combined technology of computer and atmospheric research components will be presented as a dedicated climate service component and data set.

How to cite: Meuer, J., Bouwer, L., Plésiat, É., Lehmann, R., Hoffmann, M., Ludwig, T., Karl, W., and Kadow, C.: Infilling Spatial Precipitation Recordings with a Memory-Assisted CNN, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8891, https://doi.org/10.5194/egusphere-egu22-8891, 2022.

11:20–11:26

EGU22-13102

ECS

Virtual presentation

Reading Between the (Shore)Lines: Real-Time Analytical Processing to Monitor Coastal Erosion

Zach Anthis

With the far-reaching impact of Artificial Intelligence (AI) becoming more acknowledgeable across various dimensions and industries, the Geomatics scientific community has reasonably turned to automated (in some cases, autonomous) solutions while looking to efficiently extract and communicate patterns in high-dimensional geographic data. This, in turn, has led to a range of AI platforms providing grounds for cutting-edge technologies such as data mining, image processing and predictive/prescriptive modelling. Meanwhile, coastal management bodies around the world, are striving to harness the power of AI and Machine Learning (ML) applications to act upon the wealth of coastal information, emanating from disparate data sources (e.g., geodesy, hydrography, bathymetry, mapping, remote sensing, and photogrammetry). The cross-disciplinarity of stakeholder engagement calls for thorough risk assessment and coastal defence strategies (e.g., erosion/flooding control), consistent with the emerging need for participatory and integrated policy analyses. This paper addresses the issue of seeking techno-centric solutions in human-understandable language, for holistic knowledge engineering (from acquisition to dissemination) in a spatiotemporal context; namely, the benefits of setting up a unified Visual Analytics (VA) system, which allows for real-time monitoring and Online Analytical Processing (OLAP) operations on-demand, via role-based access. Working from an all-encompassing data model could form seamlessly collaborative workspaces that support multiple programming languages (packaging ML libraries designed to interoperate) and enable heterogeneous user communities to visualize Big Data at different granularities, as well as perform task-specific queries with little, or no, programming skill. The proposed solution is an integrated coastal management dashboard, built natively for the cloud (aka leveraging batch and stream processing), to dynamically host live Key Performance Indicators (KPIs) whilst ensuring wide adoption and sustainable operation. The results reflect the value of effectively collecting and consolidating coastal (meta-)data into open repositories, to jointly produce actionable insight in an efficient manner.

How to cite: Anthis, Z.: Reading Between the (Shore)Lines: Real-Time Analytical Processing to Monitor Coastal Erosion, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13102, https://doi.org/10.5194/egusphere-egu22-13102, 2022.

11:26–11:32

EGU22-8323

Virtual presentation

Multi-attribute geolocation inference from tweets

Umair Qazi, Ferda Ofli, and Muhammad Imran

Geotagged social media messages, especially from Twitter, can have a substantial impact on decision-making processes during natural hazards and disasters. For example, such geolocation information can be used to enhance natural hazard detection systems where real-time geolocated tweets can help identify the critical human-centric hotspots of an emergency where urgent help is required.

Our work can extract geolocation information from tweets by making use of five meta-data attributes provided by Twitter. Three of these are free-form text, namely tweet text, user profile description, and user location. The other two attributes are GPS coordinates and place tags.

Tweet text may or may not have relevant information to extract geolocation. In the cases where location information is available within tweet text, we follow toponym extraction from the text using Named Entity Recognition and Classification (NERC). The extracted toponyms are then used to obtain geolocation information using Nominatim (which is open-source geocoding software that powers OpenStreetMap) at various levels such as country, state, county, city.

Similar process is followed for user profile description where only location toponyms identified by NERC are stored and then geocoded using Nominatim at various levels.

User location field, which is also a free form text, can have mentions of multiple locations such as USA, UK. To extract location from this field a heuristic algorithm is adopted based on a ranking mechanism that allows it to be resolved to a single point of location which can be then mapped at various levels such as country, state, county, city.

GPS coordinates provide the exact longitude and latitude of the device's location. We perform reverse geocoding to obtain additional location details, e.g., street, city, or country the GPS coordinates belong to. For this purpose, we use Nominatim’s reverse API endpoint to extract city, county, state, and country information.

Place tag provides a bounding box or an exact longitude and latitude or name information of location-tagged by the user. The place field data contains several location attributes. We extract location information from different location attributes within the place using different algorithms. Nominatim’s search API endpoint to extract city, county, state, and country names from the Nominatim response if available.

Our geo-inference pipeline is designed to be used as a plug-in component. The system spans an elasticsearch cluster with six nodes for efficient and fast querying and insertion of records. It has already been tested on geolocating more than two billion covid-related tweets. The system is able to handle high insertion and query load. We have implemented smart caching mechanisms to avoid repetitive Nominatim calls since it is an expensive operation. The caches are available both for free-form text (Nominatim’s search API) and exact latitude and longitude (Nominatim’s reverse API). These caches help reduce the load on Nominatim and give quick access to the most commonly queried terms.

With this effort, we hope to provide the necessary means for researchers and practitioners who intend to explore social media data for geo-applications.

How to cite: Qazi, U., Ofli, F., and Imran, M.: Multi-attribute geolocation inference from tweets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8323, https://doi.org/10.5194/egusphere-egu22-8323, 2022.

11:32–11:38

EGU22-3131

Presentation form not yet defined

Language model for Earth science for semantic search

Rahul Ramachandran, Muthukumaran Muthukumaran Ramasubramanian, Prasanna Koirala, Iksha Gurung, and Manil Maskey

Recent advances in technology have transformed the Natural Language Technology (NLT) landscape, specifically, the use of transformers to build language models such as BERT and GPT3. Furthermore, it has been shown that the quality and the domain-specificity of input corpus to language models can improve downstream application results. However, Earth science research has minimal efforts focused on building and using a domain-specific language model.

We utilize a transfer learning solution that uses an existing language model trained for general science (SciBERT) and fine-tune it using abstracts and full text extracted from various Earth science journals to create BERT-E (BERT for Earth Science). The training process utilized the input of 270k+ Earth science articles with almost 6 million paragraphs. We used Masked Language Modeling (MLM) to train the transformer model. MLM works by masking random words in the paragraph and optimizing the model for predicting the right masked word. BERT-E was evaluated by performing a downstream keyword classification task, and the performance was compared against classification results using the original SciBERT Language Model. The SciBERT-based model attained an accuracy of 89.99, whereas the BERT-E-based model attained an accuracy of 92.18, showing an improvement in overall performance.

We investigate employing language models to provide new semantic search capabilities for unstructured text such as papers. This search capability requires utilizing a knowledge graph generated from Earth science corpora with a language model and convolutions to surface latent and related sentences for a natural language query. The sentences in the papers are modeled in the graph as nodes, and these nodes are connected through entities. The language model is used to give sentences a numeric representation. Graph convolutions are then applied to sentence embeddings to obtain a vector representation of the sentence along with combined representation of the surrounding graph structure. This approach utilizes both the power of adjacency inherently encoded in graph structures and latent knowledge captured in the language model. Our initial proof of concept prototype used SIMCSE training algorithm (and the tinyBERT architecture) as the embedding model. This framework has demonstrated an improved ability to surface relevant, latent information based on the input query. We plan to show new results using the domain-specific BERT-E model.

How to cite: Ramachandran, R., Muthukumaran Ramasubramanian, M., Koirala, P., Gurung, I., and Maskey, M.: Language model for Earth science for semantic search, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3131, https://doi.org/10.5194/egusphere-egu22-3131, 2022.

11:38–11:50

Plenary Discussion