Big Data and AI in the Earth Sciences

Artificial Intelligence (AI), especially Machine Learning (ML), has proven a powerful, effective tool for gaining insight from Earth data that otherwise would be hard to attain. However, for deeper insights AI/ML has to operate on unprecedented volumes of data - a fusion of AI/ML with Big Data technologies is a cornerstone for further progress in these domains. While Big Data volume and velocity are being addressed, variety and veracity remain challenges so far.

This session aims at bringing together researchers working with big data sets generated from monitoring networks, extensive observational campaigns and detailed modeling efforts across various fields of geosciences. Topics of this session will include the identification and handling of specific problems arising from the need to analyze such large-scale data sets, together with methodological approaches towards semi or fully automated inference of relevant patterns in time and space aided by computer science-inspired techniques. Among others, this session shall address approaches from the following fields:

* Big Data and AI in the Earth sciences
* Machine Learning, Deep Learning, and Data Mining applications in the Earth sciences
* Visualization and visual analytics of Big, multi-, and high-dimensional Data
* Dimensionality and complexity of Big Data sets
* Emerging Big Data paradigms, such as Datacubes
* Computer and Data Science aspects in the Earth sciences

Co-sponsored by IEEE GRSS
Convener: Peter Baumann | Co-conveners: Otoniel José Campos EscobarECSECS, Sandro Fiore, Mikhail Kanevski, Kwo-Sen Kuo
vPICO presentations
| Thu, 29 Apr, 15:30–17:00 (CEST)

vPICO presentations: Thu, 29 Apr

Chairpersons: Peter Baumann, Otoniel José Campos Escobar
Luis Angel Vega Ramirez, Ronald Michael Splez Madero, Juan Contreras Perez, David Caress, David A. Clague, and Jennifer B. Paduan

The mapping of faults and fractures is a problem of high relevance in Earth Sciences. However, their identification in digital elevation models is a time-consuming task given the resulting networks' fractal nature. The effort is especially challenging in submarine environments, given their inaccessibility and difficulty in collecting direct observations. Here, we propose a semi-automated method for detecting faults in high-resolution gridded bathymetry data (~1 m horizontal and ~0.2 m vertical) of the Pescadero Basin in the southern Gulf of California, which were collected by MBARI's D. Allan B autonomous underwater vehicle. This problem is well suited to be explored by machine learning and deep-learning methods. The method learns from a model trained to recognize fault-line scarps based on key morphological attributes in the neighboring Alarcón Rise. We use the product of the mass diffusion coefficient with time, scarp height, and root-mean-square error as training attributes. The method consists of projecting the attributes from a three-dimensional space to a one-dimensional space in which normal probability density functions are generated to classify faults. The LDA implementation results in various cross-sectional profiles along the Pescadero Basin show that the proposed method can detect fault-line scarps of different sizes and degradation stages. Moreover, the method is robust to moderate amounts of noise (i.e., random topography and data collection artifacts) and correctly handles different fault dip angles. Experiments show that both isolated and linkage fault configurations are detected and tracked reliably.

How to cite: Vega Ramirez, L. A., Splez Madero, R. M., Contreras Perez, J., Caress, D., Clague, D. A., and Paduan, J. B.: A new Method for Fault-Scarp Detection Using Linear Discriminant Analysis (LDA) in High-Resolution Bathymetry Data From the Alarcón Rise and Pescadero Basin, Gulf of California., EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-417,, 2021.

Antonios Konstantaras, Theofanis Frantzeskakis, Emmanouel Maravelakis, Alexandra Moshou, and Panagiotis Argyrakis

This research aims to depict ontological findings related to topical seismic phenomena within the Hellenic-Seismic-Arc via deep-data-mining of the existing big-seismological-dataset, encompassing a deep-learning neural network model for pattern recognition along with heterogeneous parallel processing-enabled interactive big data visualization. Using software that utilizes the R language, seismic data were 3D plotted on a 3D Cartesian plane point cloud viewer for further investigation of the formed three-dimensional morphology. As a means of mining information from seismic big data, a deep neural network was trained and refined for pattern recognition and occurrence manifestation attributes of seismic data of magnitudes greater than Ms 4.0. The deep learning neural network comprises of an input layer with six input neurons for the insertion of year, month, day, latitude, longitude and depth, followed by six hidden layers with a hundred neurons each, and one output layer of the estimated magnitude level. This approach was conceptualised to investigate for topical patterns in time yielding minor, interim and strong seismic activity, such as the one depicted by the deep learning neural network, observed in the past ten years on the region between Syrna and Kandelioussa. This area’s coordinates are around 36,4 degrees in latitude and 26,7 degrees in longitude, with the deep learning neural network achieving low error rates, possibly depicting a pattern in seismic activity.


Axaridou A., I. Chrysakis, C. Georgis, M. Theodoridou, M. Doerr, A. Konstantaras, and E. Maravelakis. 3D-SYSTEK: Recording and exploiting the production workflow of 3D-models in cultural heritage. IISA 2014 - 5th International Conference on Information, Intelligence, Systems and Applications, 51-56, 2014.

Konstantaras A. Deep Learning and Parallel Processing Spatio-Temporal Clustering Unveil New Ionian Distinct Seismic Zone. Informatics, 7 (4), 39, 2020.

Konstantaras A.J. Expert knowledge-based algorithm for the dynamic discrimination of interactive natural clusters. Earth Science Informatics. 9 (1), 95-100, 2016.

Konstantaras A.J. Classification of distinct seismic regions and regional temporal modelling of seismicity in the vicinity of the Hellenic seismic arc. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 6 (4), 1857-1863, 2012.

Konstantaras A., F. Vallianatos, M.R. Varley, J.P. Makris. Soft-Computing modelling of seismicity in the southern Hellenic Arc. IEEE Geoscience and Remote Sensing Letters, 5 (3), 323-327, 2008.

Konstantaras A., M.R. Varley, F. Vallianatos, G. Collins and P. Holifield. Recognition of electric earthquake precursors using neuro-fuzzy methods: methodology and simulation results. Proc. IASTED Int. Conf. Signal Processing, Pattern Recognition and Applications (SPPRA 2002), Crete, Greece, 303-308, 2002.

Maravelakis E., Konstantaras A., Kilty J., Karapidakis E. and Katsifarakis E. Automatic building identification and features extraction from aerial images: Application on the historic 1866 square of Chania Greece. 2014 International Symposium on Fundamentals of Electrical Engineering (ISFEE), Bucharest, 1-6, 2014. doi: 10.1109/ISFEE.2014.7050594.

Maravelakis E., A. Konstantaras, K. Kabassi, I. Chrysakis, C. Georgis and A. Axaridou. 3DSYSTEK web-based point cloud viewer. IISA 2014 - 5th International Conference on Information, Intelligence, Systems and Applications, 262-266, 2014.

Maravelakis E., Bilalis N., Mantzorou I., Konstantaras A. and Antoniadis A. 3D modelling of the oldest olive tree of the world. International Journal Of Computational Engineering Research. 2 (2), 340-347, 2012.

How to cite: Konstantaras, A., Frantzeskakis, T., Maravelakis, E., Moshou, A., and Argyrakis, P.: Heterogeneous Parallel Processing Enabled Deep Learning Pattern Recognition of Seismic Big Data in Syrna and Kandelioussa, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-501,, 2021.

Hristos Tyralis, Georgia Papacharalampous, and Andreas Langousis

Random forests is a supervised machine learning algorithm which has witnessed recently an exponential increase in its implementation in water resources. However, the existing implementations have been restricted in applications of Breiman’s (2001) original algorithm to regression and classification models, while numerous developments could be also useful for solving diverse practical problems. Here we popularize random forests for the practicing hydrologist and present alternative random forests based algorithms and related concepts and techniques, which are underappreciated in hydrology. We review random forests applications in water resources and provide guidelines for the full exploitation of the potential of the algorithm and its variants. Relevant implementations of random forests related software in the R programming language are also presented.

How to cite: Tyralis, H., Papacharalampous, G., and Langousis, A.: Random forests in water resources, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-2105,, 2021.

Mikhail Kanevski

Nowadays a wide range of methods and tools to study and forecast time series is available. An important problem in forecasting concerns embedding of time series, i.e. construction of a high dimensional space where forecasting problem is considered as a regression task. There are several basic linear and nonlinear approaches of constructing such space by defining an optimal delay vector using different theoretical concepts. Another way is to consider this space as an input feature space – IFS, and to apply machine learning feature selection (FS) algorithms to optimize IFS according to the problem under study (analysis, modelling or forecasting). Such approach is an empirical one: it is based on data and depends on the FS algorithms applied. In machine learning features are generally classified as relevant, redundant and irrelevant. It gives a reach possibility to perform advanced multivariate time series exploration and development of interpretable predictive models.

Therefore, in the present research different FS algorithms are used to analyze fundamental properties of time series from empirical point of view. Linear and nonlinear simulated time series are studied in detail to understand the advantages and drawbacks of the proposed approach. Real data case studies deal with air pollution and wind speed times series. Preliminary results are quite promising and more research is in progress.

How to cite: Kanevski, M.: Empirical analysis of time series using feature selection algorithms, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-6697,, 2021.

Otoniel José Campos Escobar and Peter Baumann

Multi-dimensional arrays (also known as raster data, gridded data, or datacubes) are key, if not essential, in many science and engineering domains. In the case of Earth sciences, a significant amount of the data that is produced falls into the category of array data. That being said, the amount of data that is produced daily from this field is huge. This makes it hard for researchers to analyze and retrieve any valuable insight from it. 1-D sensor data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z subsurface voxel data, 4-D x/y/z/t atmospheric and ocean data often produce dozens of Terabytes of data every day, and the rate is only expected to increase in the future. In response, Array Databases systems were specifically designed and constructed to provide modeling, storage, and processing support for multi-dimensional arrays. They offer a declarative query language for flexible data retrieval and some, e.g., rasdaman, provide federation processing and standard-based query capabilities compliant with OGC standards such as WCS, WCPS, and WMS. However, despite these advances, the gap between efficient information retrieval and the actual application of this data remains very broad, especially in the domain of artificial intelligence AI and machine learning ML.

In this contribution, we present the state-of-art in performing ML through Array Databases. First, a motivating example is introduced from the Deep Rain Project which aims at enhancing rainfall prediction accuracy in mountainous areas by implementing ML code on top of an Array Database. Deep Rain also explores novel methods for training prediction models by implementing server-side ML processing inside the database. A brief introduction of the Array Database rasdaman that is used in this project is also provided featuring its standard-based query capabilities and scalable federation processing features that are required for rainfall data processing. Next, the workflow approach for ML and Array Databases that is employed in the Deep Rain project is described in detail listing the benefits of using an Array Database with declarative query language capabilities in the machine learning pipeline. A concrete use case will be used to illustrate step by step how these tools integrate. Next, an alternative approach will be presented where ML is done inside the Array Database using user-defined functions UDFs. Finally,  a detailed comparison between the UDF and workflow approach is presented explaining their challenges and benefits.

How to cite: Campos Escobar, O. J. and Baumann, P.: Towards AI in Array Databases, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-7409,, 2021.

Andreas Gerhardus and Jakob Runge

The quest to understand cause and effect relationships is at the basis of the scientific enterprise. In cases where the classical approach of controlled experimentation is not feasible, methods from the modern framework of causal discovery provide an alternative way to learn about cause and effect from observational, i.e., non-experimental data. Recent years have seen an increasing interest in these methods from various scientific fields, for example in the climate and Earth system sciences (where large scale experimentation is often infeasible) as well as machine learning and artificial intelligence (where models based on an understanding of cause and effect promise to be more robust under changing conditions.)

In this contribution we present the novel LPCMCI algorithm for learning the cause and effect relationships in multivariate time series. The algorithm is specifically adapted to several challenges that are prevalent in time series considered in the climate and Earth system sciences, for example strong autocorrelations, combinations of time lagged and contemporaneous causal relationships, as well as nonlinearities. It moreover allows for the existence of latent confounders, i.e., it allows for unobserved common causes. While this complication is faced in most realistic scenarios, especially when investigating a system as complex as Earth's climate system, it is nevertheless assumed away in many existing algorithms. We demonstrate applications of LPCMCI to examples from a climate context and compare its performance to competing methods.

Related reference:
Gerhardus, Andreas and Runge, Jakob (2020). High-recall causal discovery for autocorrelated time series with latent confounders. In Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020). 

How to cite: Gerhardus, A. and Runge, J.: LPCMCI: Causal Discovery in Time Series with Latent Confounders, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-8259,, 2021.

Christoph Käding and Jakob Runge

The Earth’s climate is a highly complex and dynamical system. To better understand and robustly predict it, knowledge about its underlying dynamics and causal dependency structure is required. Since controlled experiments are infeasible in the climate system, observational data-driven approaches are needed. Observational causal inference is a very active research topic and a plethora of methods have been proposed. Each of these approaches comes with inherent strengths, weaknesses, and assumptions about the data generating process as well as further constraints.
In this work, we focus on the fundamental case of bivariate causal discovery, i.e., given two data samples X and Y the task is to detect whether X causes Y or Y causes X. We present a large-scale benchmark that represents combinations of various characteristics of data-generating processes and sample sizes. By comparing most of the current state-of-the-art methods, we aim to shed light onto the real-world performance of evaluated methods. Since we employ synthetic data, we are able to precisely control the data characteristics and can unveil the behavior of methods when their underlying assumptions are met or violated. Further, we give a comparison on a set of real-world data with known causal relations to complete our evaluation.

How to cite: Käding, C. and Runge, J.: A Benchmark for Bivariate Causal Discovery Methods, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-8584,, 2021.

Vadim Rezvov, Mikhail Krinitskiy, Alexander Gavrikov, and Sergey Gulev

Surface winds — both wind speed and vector wind components — are fields of fundamental climatic importance. The character of surface winds greatly influences (and is influenced by) surface exchanges of momentum, energy, and matter. These wind fields are of interest in their own right, particularly concerning the characterization of wind power density and wind extremes. Surface winds are influenced by small-scale features such as local topography and thermal contrasts. That is why accurate high-resolution prediction of near‐surface wind fields is a topic of central interest in various fields of science and industry. Statistical downscaling is the way for inferring information on physical quantities at a local scale from available low‐resolution data. It is one of the ways to avoid costly high‐resolution simulations. Statistical downscaling connects variability of various scales using statistical prediction models. This approach is fundamentally data-driven and can only be applied in locations where observations have been taken for a sufficiently long time to establish the statistical relationship. Our study considered statistical downscaling of surface winds (both wind speed and vector wind components) in the North Atlantic. Deep learning methods are among the most outstanding examples of state‐of‐the‐art machine learning techniques that allow approximating sophisticated nonlinear functions. In our study, we applied various approaches involving artificial neural networks for statistical downscaling of near‐surface wind vector fields. We used ERA-Interim reanalysis as low-resolution data and RAS-NAAD dynamical downscaling product (14km grid resolution) as a high-resolution target. We compared statistical downscaling results to those obtained with bilinear/bicubic interpolation with respect to downscaling quality. We investigated how network complexity affects downscaling performance. We will demonstrate the preliminary results of the comparison and propose the outlook for further development of our methods.

This work was undertaken with financial support by the Russian Science Foundation grant № 17-77-20112-P.

How to cite: Rezvov, V., Krinitskiy, M., Gavrikov, A., and Gulev, S.: Comparison of AI-based approaches for statistical downscaling of surface wind fields in the North Atlantic, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-8844,, 2021.

Benjamin Kellenberger, Thor Veen, Eelke Folmer, and Devis Tuia

Recently, Unmanned Aerial Vehicles (UAVs) equipped with high-resolution imaging sensors have become a viable alternative for ecologists to conduct wildlife censuses, compared to foot surveys. They cause less disturbance by sensing remotely, they provide coverage of otherwise inaccessible areas, and their images can be reviewed and double-checked in controlled screening sessions. However, the amount of data they generate often makes this photo-interpretation stage prohibitively time-consuming.

In this work, we automate the detection process with deep learning [4]. We focus on counting coastal seabirds on sand islands off the West African coast, where species like the African Royal Tern are at the top of the food chain [5]. Monitoring their abundance provides invaluable insights into biodiversity in this area [7]. In a first step, we obtained orthomosaics from nadir-looking UAVs over six sand islands with 1cm resolution. We then fully labelled one of them with points for four seabird species, which required three weeks for five annotators to do and resulted in over 21,000 individuals. Next, we further labelled the other five orthomosaics, but in an incomplete manner; we aimed for a low number of only 200 points per species. These points, together with a few background polygons, served as training data for our ResNet-based [2] detection model. This low number of points required multiple strategies to obtain stable predictions, including curriculum learning [1] and post-processing by a Markov random field [6]. In the end, our model was able to accurately predict the 21,000 birds of the test image with 90% precision at 90% recall (Fig. 1) [3]. Furthermore, this model required a mere 4.5 hours from creating training data to the final prediction, which is a fraction of the three weeks needed for the manual labelling process. Inference time is only a few minutes, which makes the model scale favourably to many more islands. In sum, the combination of UAVs and machine learning-based detectors simultaneously provides census possibilities with unprecedentedly high accuracy and comparably minuscule execution time.