This session aims to bring together researchers working with big data sets generated from monitoring networks, extensive observational campaigns and detailed modeling efforts across various fields of geosciences. Topics of this session will include the identification and handling of specific problems arising from the need to analyze such large-scale data sets, together with methodological approaches towards semi or fully automated inference of relevant patterns in time and space aided by computer science-inspired techniques. Among others, this session shall address approaches from the following fields:
• Dimensionality and complexity of big data sets
• Data mining in Earth sciences
• Machine learning, deep learning and Artificial Intelligence applications in geosciences
• Visualization and visual analytics of big and high-dimensional data
• Informatics and data science
• Emerging big data paradigms, such as datacubes

Co-organized by AS5/CL5/ESSI2/G6/GD10/HS3/SM1
Convener: Mikhail Kanevski | Co-conveners: Peter Baumann, Sandro Fiore, Kwo-Sen Kuo, Nicolas Younan
| Thu, 07 May, 08:30–12:30 (CEST), Thu, 07 May, 14:00–15:45 (CEST)

Files for download

Download all presentations (294MB)

Chat time: Thursday, 7 May 2020, 08:30–10:15

Chairperson: M. Kanevski, S. Fiore
D2382 |
Jon Seddon and Ag Stephens

The PRIMAVERA project aims to develop a new generation of advanced and well evaluated high-resolution global climate models. An integral component of PRIMAVERA is a new set of simulations at standard and high-resolution from seven different European climate models. The expected data volume is 1.6 petabytes, which is comparable to the total volume of data in CMIP5.  

A comprehensive Data Management Plan (DMP) was developed to allow the distributed group of scientists to produce and analyse this volume of data during the project’s limited time duration. The DMP uses the approach of taking the analysis to the data. The simulations were run on HPCs across Europe and the data was transferred to the JASMIN super-data-cluster at the Rutherford Appleton Laboratory. A Data Management Tool (DMT) was developed to catalogue the available data and allow users to search through it using an intuitive web-based interface. The DMT allows users to request that the data they require is restored from tape to disk. The users are then able to perform all their analyses at JASMIN. The DMT also controls the publication of the data to the Earth System Grid Federation, making it available to the global community. 

Here we introduce JASMIN and the PRIMAVERA data management plan. We describe how the DMT allowed the project’s scientists to analyse this multi-model dataset. We describe how the tools and techniques developed can help future projects.

How to cite: Seddon, J. and Stephens, A.: Data management and analysis of the high-resolution multi-model climate dataset from the PRIMAVERA project , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-14241, https://doi.org/10.5194/egusphere-egu2020-14241, 2020

D2383 |
Nanzhe Wang and Haibin Chang

Subsurface flow problems usually involve some degree of uncertainty. For reducing the uncertainty of subsurface flow prediction, data assimilation is usually necessary. Data assimilation is time consuming. In order to improve the efficiency of data assimilation, surrogate model of subsurface flow problem may be utilized. In this work, a physics-informed neural network (PINN) based surrogate model is proposed for subsurface flow with uncertain model parameters. Training data generated by solving stochastic partial differential equations (SPDEs) are utilized to train the neural network. Besides the data mismatch term, the term that incorporates physics laws is added in the loss function. The trained neural network can predict the solutions of the subsurface flow problem with new stochastic parameters, which can serve as a surrogate for approximating the relationship between model output and model input. By incorporating physics laws, PINN can achieve high accuracy. Then an iterative ensemble smoother (ES) is introduced to implement the data assimilation task based on the PINN surrogate. Several subsurface flow cases are designed to test the performance of the proposed paradigm. The results show that the PINN surrogate can significantly improve the efficiency of data assimilation task while maintaining a high accuracy.

How to cite: Wang, N. and Chang, H.: Data assimilation of subsurface flow via iterative ensemble smoother and physics-informed neural network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-12224, https://doi.org/10.5194/egusphere-egu2020-12224, 2020

D2384 |
How to improve 3-m resolution land cover mapping from imperfect 10-m resolution land cover mapping product?
Runmin Dong and Haohuan Fu
D2385 |
Daniel Lee, Rodrigo Romero, Peter Miu, Fernando Jose Pereda Garcimartin, and Oscar Perez Navarro

EUMETSAT hosts a large collection of geophysical data sets that have been produced by over 35 years of operational meteorological satellites. This trove of remote sensing data products is complex, featuring observations from multiple generations of polar and geostationary satellites. Each mission has different primary objectives, resulting in different instrument payloads, resolutions, and variables observed. As EUMETSAT's next-generation core missions are launched and joined by smaller missions with narrower foci, both the size and complexity of these data will increase exponentially.

The data alone are a valuable resource for the geosciences, but the value that can be extracted from them increases greatly when they are combined with data from other disciplines. As EUMETSAT's primary missions are focused on observational meteorology, the potential synergies with e.g. numerical weather prediction data are readily apparent. However, EUMETSAT data is increasingly used in applications from other domains, e.g. oceanography, agriculture, and atmospheric composition, to name just a few.

New solutions are being implemented to unlock the potential of EUMETSAT's data, particularly in combination with data from other disciplines and leveraging emerging data-driven approaches such as data mining and machine learning. A particular challenge in this regard is the heterogeneity of the individual data products, each of which is optimised to accurately describe the observed variable and quality information associated with the observing instrument and platform. A further challenge is the heterogeneity of the potential users, all of whom have preferred toolsets and processing chains.

The EUMETSAT Data Tailor is part of a larger initiative at EUMETSAT to support users in taking full advantage of our data holdings. It addresses the problem that there is no single "best format" for all users by allowing users to tailor data products to fit their needs. With it, users can extract the data that is relevant for them by selecting by geospatial and spectral criteria, resample into the projection and resolution that they require, and reformat the data into a variety of popular formats. Tailoring workflows can be created graphically or written by hand in YAML and saved in a given Data Tailor deployment.

The Data Tailor is cloud-native, exposing its functionality as a microservice, via a web UI, on the command line, and as a Python package. Support for additional functions can be added easily via its plug-in architecture, which allows dynamically adding and removing functionality to an installation. It is released under an Apache v2 license, making it easy to deploy the software in any context. Whether data is in flight or at rest, the Data Tailor offers users easy access to EUMETSAT products in the format of their choice.

This presentation will showcase the Data Tailor and briefly address other exciting developments at EUMETSAT that the Data Tailor is integrated with that will support big data workflows with EUMETSAT's past, present, and future data.

How to cite: Lee, D., Romero, R., Miu, P., Jose Pereda Garcimartin, F., and Perez Navarro, O.: Data Tailor: Integrate EUMETSAT's data into your datacube, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-18670, https://doi.org/10.5194/egusphere-egu2020-18670, 2020

D2386 |
Julia Wagemann, Stephan Siemen, Jörg Bendix, and Bernhard Seeger

The European Commission’s Earth Observation programme Copernicus produces an unprecedented amount of openly available multi-dimensional environmental data. However, data ‘accessibility’ remains one of the biggest obstacles for users of open Big Earth Data and hinders full data exploitation. Data services have to evolve from pure download services to offer an easier and more on-demand data access. There are currently different concepts explored to make Big Earth Data better accessible for users, e.g. virtual research infrastructures, data cube technologies, standardised web services or cloud processing services, such as the Google Earth Engine or the Copernicus Climate Data Store Toolbox. Each offering provides different types of data, tools and functionalities. Data services are often developed solely satisfying specific user requirements and needs.

For this reason, we conducted a user requirements survey between November 2018 and June 2019 among users of Big Earth Data (including users of Earth Observation data, meteorological and environmental forecasts and other geospatial data) to better understand user requirements of Big Earth Data. To reach an active data user community for this survey, we partnered with ECMWF, which has 40 years of experience in providing data services for weather forecast data and environmental data sets of the Copernicus Programme.

We were interested in which datasets users currently use, which datasets they would like to use in the future and the reasons why they have not yet explored certain datasets. We were interested in the tools and software they use to process the data and what challenges they face in accessing and handling Big Earth Data. Another part focused on future (cloud-based) data services and there, we were interested in the users’ motivation to migrate their data processing tasks to cloud-based data services and asked them what aspects of these services they consider being important.

While preliminary results of the study were released last year, this year the final study results are presented. A specific focus will be put on users’ expectation of future (cloud-based) data services aligned with recommendations for data users and data providers alike to ensure the full exploitation of Big Earth Data in the future.

How to cite: Wagemann, J., Siemen, S., Bendix, J., and Seeger, B.: Bridging the gap between Big Earth data users and future (cloud-based) data systems - Towards a better understanding of user requirements of cloud-based data systems, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-10029, https://doi.org/10.5194/egusphere-egu2020-10029, 2020

D2387 |
Simon Baillarin, Pierre-Marie Brunet, Pierre Lassalle, Gwenael Souille, Laurent Gabet, Gilles Foulon, Gaelle Romeyer, Cedrik Ferrero, Thanh-Long Huynh, Antoine Masse, and Stan Assier

The availability of 3D Geospatial information is a key stake for many expanding sectors such as autonomous vehicles, business intelligence and urban planning.

The availability of huge volumes of satellite, airborne and in-situ data now makes this production feasible on a large scale. It needs nonetheless a certain level of skilled manual intervention to secure a certain level of quality, which prevents mass production.

New artificial intelligence and big data technologies are key in lifting these obstacles.

The AI4GEO project aims at developing an automatic solution for producing 3D geospatial information and offer new value-added services leveraging innovative methods adapted to 3D imagery.

The AI4GEO consortium consists of institutional partners (CNES, IGN, ONERA) and industrial groups (CS-SI, AIRBUS, CLS, GEOSAT, QWANT, QUANTCUBE) covering the whole value chain of Geospatial Information.

With a 4 years’ timeline, the project is structured around 2 R&D axes which will progress simultaneously and feed each other.

The first axis consists in developing a set of technological bricks allowing the automatic production of qualified 3D maps composed of 3D objects and associated semantics. This collaborative work benefits from the latest research from all partners in the field of AI and Big Data technologies as well as from an unprecedented database (satellite and airborne data (optics, radars, lidars) combined with cartographic and in-situ data).

The second axis consists in deriving from these technological bricks a variety of services for different fields: 3D semantic mapping of cities, macroeconomic indicators, decision support for water management, autonomous transport, consumer search engine.

Started in 2019, the first axis of the project has already produced very promising results. A first version of the platform and technological bricks are now available.

This paper will first introduce AI4GEO initiative: context and overall objectives.

It will then present the current status of the project and in particular it will focus on the innovative approach to handle big 3D datasets for analytics needs and it will present the first results of 3D semantic segmentations on various test sites and associated perspectives.

How to cite: Baillarin, S., Brunet, P.-M., Lassalle, P., Souille, G., Gabet, L., Foulon, G., Romeyer, G., Ferrero, C., Huynh, T.-L., Masse, A., and Assier, S.: AI4GEO: An automatic 3D geospatial information capability , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-11559, https://doi.org/10.5194/egusphere-egu2020-11559, 2020

D2388 |
elisabeth lambert, Jean-michel Zigna, Thomas Zilio, and Flavien Gouillon

The volume of data in the earth data observation domain grows considerably, especially with the emergence of new generations of satellites providing much more precise measures and thus voluminous data and files. The ‘big data’ field provides solutions for storing and processing huge amount of data. However, there is no established consensus, neither in the industrial market nor the open source community, on big data solutions adapted to the earth data observation domain. The main difficulty is that these multi-dimensional data are not naturally scalable. CNES and CLS, driven by a CLS business needs, carried out a study to address this difficulty and try to answer it.

Two use cases have been identified, these two being complementary because at different points in the value chain: 1) the development of an altimetric processing chain, storing low level altimetric measurements from multiple satellite missions, and 2) the extraction of oceanographic environmental data along animal and ships tracks. The original data format of these environmental variables is netCDF. We will first show the state of the art of big data technologies that are adapted to this problematic and their limitations. Then, we will describe the prototypes behind both use cases and in particular how the data is split into independent chunks that then can be processed in parallel. The storage format chosen is the Apache parquet and in the first use case, the manipulation of the data is made with the xarray library while all the parallel processes are implemented with the Dask framework. An implementation using Zarr library instead of Parquet has also been developed and results will also be shown. In the second use case, the enrichment of the track with METOC (Meteo/Oceanographic) data is developed using the Spark framework. Finally, results of this second use case, that runs operationally today for the extraction of oceanographic data along tracks, will be shown. This second solution is an alternative to Pangeo solution in the world of industrial and Java development. It extends the traditional THREDDS subsetter, delivered by the Open source Unidata Community, to a bigdata implementation. This Parquet storage and associated service implements a smoothed transition of gridded data in Big Data infrastructures.

How to cite: lambert, E., Zigna, J., Zilio, T., and Gouillon, F.: Parquet Cube to store and process gridded data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-14325, https://doi.org/10.5194/egusphere-egu2020-14325, 2020

D2389 |
zhou chen, Yue deng, and Jing-Song wang

TEC is very important ionospheric parameter, which is commonly used observation for studying various ionospheric physical mechanism and other technological related to ionosphere (i.e. Global Positioning). However, the variation of global TEC is very dynamic, and its spatiotemporal variation is extremely complicated. So in this paper, we try to build a novel global ionospheric TEC (total electron content) predicting model based on two deep learning algorithms: generative adversarial network (GAN) and long short-term memory (LSTM). Training data is from 10-year IGS TEC data, which provide plenty of data for the GAN and LSTM algorithm to obtain the spatial and temporal variation of TEC respectively. The prediction accuracy of this model have been calculated under different levels of geomagnetic activity. The statistic result suggest that the proposed ionospheric model can be used as an efficient tool for ionospheric TEC short-time prediction.

How to cite: chen, Z., deng, Y., and wang, J.-S.: Applying LSTM and GAN to build a deep learning model (TGAN-TEC) for global ionospheric TEC, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-6688, https://doi.org/10.5194/egusphere-egu2020-6688, 2020

D2390 |
Petrina Papazek and Irene Schicker

In this study, we present a deep learning-based method to provide short-range point-forecasts (1-2 days ahead) of the 10-meter wind speed for complex terrain. Gridded data with different horizontal resolutions from numeric weather prediction (NWP) models, gridded observations, and point data are used. An artificial neural network (ANN), able to process several differently structured inputs simultaneously, is developed.
The heterogeneous structure of inputs is targeted by the ANN by combining convolutional, long-short-term-memory (LSTM), fully connected (FC) layers, and others within a common network. Convolutional layers efficiently solve image processing tasks, however, they are applicable to any gridded data source. An LSTM layer models recurrent steps in the ANN and is, thus, useful for time-series, such as meteorological observations. Further key objectives of this research are to consider different spatial and temporal resolutions and different topographic characteristics of the selected sites.

Data from the Austrian TAWES system (Teilautomatische Wetterstationen, meteorological observations in 10-minute intervals), INCA's (Integrated Nowcasting through Comprehensive Analysis) gridded observation fields, and NWP data from the ECMWF IFS (European Center for Medium-Range Weather Forecast’s Integrated Forecasting System) model are used in this study. Hourly runs for 12 test locations (selected TAWES sites representing different topographic  characteristics in Austria) and different seasons are conducted.
The ANN’s results yield, in general, high forecast-skills (MAE=1.13 m/s, RMSE=1.72 m/s), indicating a successful learning based on the used training data. Different combinations of the number of input field grid points were investigated centering around the target sites. It is shown that a small number of ECMWF IFS grid Points (e.g.: 5x5 grid points) and a higher number of INCA grid points (e.g.: 15x15) resulted in the best performing forecasts. The different number of grid points is directly related to the models' resolution. However, keeping the nowcasting-range in mind, it is shown that adding NWP data does not increase the model performance. Thus, for nowcasting a stronger weighting towards the observations is important. Beyond the nowcasting range, the deep learning-based ANN model outperforms the more basic machine learning algorithms as well as other alternative models.

How to cite: Papazek, P. and Schicker, I.: A Deep Learning Method for Short-Range Point Forecasts of Wind Speed, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-4434, https://doi.org/10.5194/egusphere-egu2020-4434, 2020

D2391 |
Ashleigh Massam, Ashley Barnes, Siân Lane, Robert Platt, and David Wood

JBA Risk Management (JBA) uses JFlow®,  a two-dimensional hydraulic model, to simulate surface water, fluvial, and dam break flood risk. National flood maps are generated on a computer cluster that parallelises up to 20,000 model simulations , covering an area of up to 320,000 km3 and creating up to 10 GB of data per day.

JBA uses machine-learning models to identify artefacts in the flood simulations. The ability of machine-learning models to quickly process and detect these artefacts, combined with the use of an automated control system, means that hydraulic modelling throughput can be maximised with little user intervention. However, continual retraining of the model and application of software updates introduce the risk of a significant decrease in performance. This necessitates the use of a system to monitor the performance of the machine-learning model to ensure that a sufficient level of quality is maintained, and to allow drops in quality to be investigated.

We  present an approach used to develop performance checks on a machine-learning model that identifies artificial depth differences between hydraulic model simulations. Performance checks are centred on the use of control charts, an approach commonly used in manufacturing processes to monitor the proportion of items produced with defects. In order to develop this approach for a geoscientific context, JBA has (i) built a database of randomly-sampled hydraulic model outputs currently totalling 200 GB of data; (ii) developed metrics to summarise key features across a modelled region, including geomorphology and hydrology; (iii) used a random forest regression model to identify feature dominance to determine the most robust relationships that contribute to depth differences in the flood map; and (iv) developed the performance check in an automated system that tests every nth hydraulic modelling output against data sampled  based on common features.

The implementation of the performance checks allows JBA to assess potential changes in the quality of artificial feature identification following a training cycle in a development environment prior to release in a production environment.

How to cite: Massam, A., Barnes, A., Lane, S., Platt, R., and Wood, D.: Developing performance checks on machine-learning models in an automated system for developing hazard maps, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-5372, https://doi.org/10.5194/egusphere-egu2020-5372, 2020

D2392 |
| Highlight
Susanne Pfeifer, Katharina Bülow, and Lennart Marien

Within the Hamburg Cooperation project „HYBRIDS – Chances and Challenges of New Genomic Combinations“ (https://www.biologie.uni-hamburg.de/en/forschung/verbundvorhaben/hybride-mehr-infos.html), one subproject deals with the problem of finding relations between the existance of hybrid plant species and the climate and its variability at the same location. For this, biological and climatic data is brought together and statistical and machine learning techniques are applied to derive climatic differences between those regions where both parent species, but no hybrid species are found, and those regions where both parent species and the hybrid species were found.

Both the climate data (here daily gridded E-OBS temperature (mean, min, max) and precipitation on ~10 km grid resolution for the period of 1970 to 2006 (Haylock et al.,2008, Cornes et al., 2018)) and the plants data (Hybrid Flora of the British Isles, 700 taxa, 6 112 847 lines of data, (Stace et al., 2015)) can be considered as „big data“. However, the peculiarities of both data are very different and so are the issues to be considered when tackling the data.

We will present the first results of this interdisciplinary effort, discuss the methodological issues and elaborate on the chances and challenges of interpreting the findings.


Cornes, R., G. van der Schrier, E.J.M. van den Besselaar, and P.D. Jones. 2018: An Ensemble Version of the E-OBS Temperature and Precipitation Datasets, J. Geophys. Res. Atmos., 123.

Haylock, M. R., Hofstra, N., Klein Tank, A. M. G., Klok, E. J., Jones, P. D. & M. New (2008): A European daily high-resolution gridded data set of surface temperature and precipitation for 1950-2006. Journal of Geophysical Research Atmospheres, 113(20). https://doi.org/10.1029/2008JD010201

Stace, C.A., Preston, C.D. & D.A. Pearman (2015): Hybrid flora of the British Isles. Botanical Society of Britain & Ireland. 501pp.

How to cite: Pfeifer, S., Bülow, K., and Marien, L.: Connecting big data from climate and biology using statistics and machine learning techniques, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7125, https://doi.org/10.5194/egusphere-egu2020-7125, 2020

D2393 |
Muhammad Rizwan Asif*, Thue Sylvester Bording, Adrian S. Barfod, Bo Zhang, Jakob Juul Larsen, and Esben Auken

Inversion of large-scale time-domain electromagnetic surveys is a time consuming and computationally expensive task. Probabilistic or deterministic methodologies, such as Monte Carlo inversion or Gauss-Newton methods, require repeated calculation of forward responses, and, dependent on methodology and survey size, the number of forward responses can reach from thousands to millions. In this study, we propose a machine learning based forward modelling approach in order to significantly decrease the time required to calculate the forward responses, and thus also inversion time. We employ a fully-connected feed-forward neural network to approximate the forward modelling process. For training of the network, we generated 93,500 forward responses using AarhusInv with resistivity models derived from 9 surveys at different locations in Denmark, representing a Quaternary geological setting. The resistivity models are discretized into 30 layers with logarithmically increasing thicknesses down to 300 m, and ranges from 1 to 1,000 Ω·m. The forward responses, were modelled with 14 gates/decade from 10-7 s to 10-2 s. To ensure better network convergence, the input resistivity models are normalized after logarithmically transforming them. Furthermore, the network target outputs, i.e. forward responses, are globally normalized, where each gate is normalized in relation to the maximum and minimum values of the respective gates. This ensures each gate is prioritized equally.

The network performance is evaluated on a test set derived from a separate survey containing 5,978 resistivity models, by directly comparing the neural network based forward responses to the AarhusInv forward responses. The performance is exceptionally good, with 99.32% of all gates accurate to within 3% relative error, which is comparable to data uncertainty. The time derivatives of the generated forward models, dB/dt, are also computed by convolving a transmitter waveform. The dB/dt performance is 86.2%, but is improved to an accuracy of 98.02% within 3% error by post-processing the forward responses using a local smoothing algorithm. The low dynamic range of the target outputs induces rounding/truncation errors, which leads to jagging, and therefore increasing the error when the waveform is applied to the un-processed forward responses. However, the 1.98% of the gates that exceed the 3% error after post-processing lie within typical data uncertainty, ensuring the suitability for use in inversion schemes.

The proposed forward modelling strategy is up to 17 times faster than commonly used accurate modelling methods, and may be incorporated into either deterministic or probabilistic inversion algorithms, allowing for significantly faster inversion of large datasets.  

A TEM system having a 40 m × 40 m central loop configuration was selected for this study. However, in principle, any geometry can be applied. Additionally, the proposed scheme can be extended for other systems, such as airborne EM systems by considering the altitude as an extra input parameter.

How to cite: Asif*, M. R., Bording, T. S., Barfod, A. S., Zhang, B., Larsen, J. J., and Auken, E.: Improving computational efficiency of forward modelling for ground-based time-domain electromagnetic data using neural networks, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7067, https://doi.org/10.5194/egusphere-egu2020-7067, 2020

D2394 |
Xi Chen, Ruyi Yu, Sajid Ullah, Dianming Wu, Min Liu, Yonggui Huang, Hongkai Gao, Jie Jiang, and Ning Nie

Wind speed forecasting is very important for a lot of real-life applications, especially for controlling and monitoring of wind power plants. Owing to the non-linearity of wind speed time series, it is hard to improve the accuracy of runoff forecasting, especially several days ahead. In order to improve the forecasting performance, many forecasting models have been proposed. Recently, deep learning models have been paid great attention, since they excel the conventional machine learning models. The majority of existing deep learning models take the mean squared error (MSE) loss as the loss function for forecasting. MSE loss is linear. Consequently, it hinders further improvement of forecasting performance over nonlinear wind speed time series data.   
In this work, we propose a new weighted MSE loss function for wind speed forecasting based on deep learning. As is well known, the training procedure is dominated by easy-training samples in applications. The domination will cause the ineffectiveness and inefficiency of computation. In the new weighted MSE loss function, loss weights of samples can be automatically reduced, according to the contribution of easy-training samples. Thus, the total loss mainly focuses on hard-training samples. To verify the new loss function, Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been used as base models. 
A number of experiments have been carried out by using open wind speed time series data collected from China and Unites states to demonstrate the effectiveness of the new loss function with three popular models. The performances of the models have been evaluated through the statistical error measures, such as Mean Absolute Error (MAE). MAE of the proposed weighted MSE loss are at most 55% lower than traditional MSE loss. The experimental results indicate that the new weighted loss function can outperform the popular MSE loss function in wind speed forecasting. 

How to cite: Chen, X., Yu, R., Ullah, S., Wu, D., Liu, M., Huang, Y., Gao, H., Jiang, J., and Nie, N.: A new weighted MSE loss for wind speed forecasting based on deep learning models, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-12899, https://doi.org/10.5194/egusphere-egu2020-12899, 2020

D2395 |
Mikhail Krinitskiy, Svyatoslav Elizarov, Alexander Gavrikov, and Sergey Gulev

Accurate simulation of the physics of the Earth`s atmosphere involves solving partial differential equations with a number of closures handling subgrid processes. In some cases, the parameterizations may approximate the physics well. However, there is always room for improvement, which is often known to be computationally expensive. Thus, at the moment, modeling of the atmosphere is a theatre for a number of compromises between the accuracy of physics representation and its computational costs.

At the same time, some of the parameterizations are naturally empirical. They can be improved further based on the data-driven approach, which may provide increased approximation quality, given the same or even lower computational costs. In this perspective, a statistical model that learns a data distribution may deliver exceptional results. Recently, Generative Adversarial Networks (GANs) were shown to be a very flexible model type for approximating distributions of hidden representations in the case of two-dimensional visual scenes, a.k.a. images. The same approach may provide an opportunity for the data-driven approximation of subgrid processes in case of atmosphere modeling.

In our study, we present a novel approach for approximating subgrid processes based on conditional GANs. As proof of concept, we present the preliminary results of the downscaling of surface wind over the ocean in North Atlantic. We explore the potential of the presented approach in terms of speedup of the downscaling procedure compared to the dynamic simulations such as WRF model runs. We also study the potential of additional regularizations applied to improve the cGAN learning procedure as well as the resulting generalization ability and accuracy.

How to cite: Krinitskiy, M., Elizarov, S., Gavrikov, A., and Gulev, S.: Downscaling of surface wind speed over the North Atlantic using conditional Generative Adversarial Networks, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13686, https://doi.org/10.5194/egusphere-egu2020-13686, 2020

D2396 |
Gustau Camps-Valls, Daniel Svendsen, Luca Martino, Adrian Pérez-Suay, Maria Piles, and Jordi Muñoz-Marí

Earth observation from remote sensing satellites allows us to monitor the processes occurring on the land cover, water bodies and the atmosphere, as well as their interactions. In the last decade machine learning has impacted the field enormously due to the unprecedented data deluge and emergence of complex problems that need to be tackled (semi)automatically. One of the main problems is to perform estimation of bio-geo-physical parameters from remote sensing observations. In this model inversion setting, Gaussian processes (GPs) are one of the preferred choices for model inversion, emulation, gap filling and data assimilation. GPs do not only provide accurate predictions but also allow for feature ranking, deriving confidence intervals, and error propagation and uncertainty quantification in a principled Bayesian inference framework.

Here we introduce GPs for data analysis in general and to address the forward-inverse problem posed in remote sensing in particular. GPs are typically used for inverse modelling based on concurrent observations and in situ measurements only, or to invert model simulations. We often rely on forward radiative transfer model (RTM) encoding the well-understood physical relations to either perform model inversion with machine learning, or to replace the RTM model with machine learning models, a process known as emulation. We review four novel GP models that respect and learn the physics, and deploy useful machine learning models for remote sensing parameter retrieval and model emulation tasks. First, we will introduce a Joint GP (JGP) model that combines in situ measurements and simulated data in a single GP model for inversion. Second, we present a latent force model (LFM) for GP modelling that encodes ordinary differential equations to blend data and physical models of the system. The LFM performs multi-output regression, can cope with missing data in the time series, and provides explicit latent functions that allow system analysis, evaluation and understanding. Third, we present an Automatic Gaussian Process Emulator (AGAPE) that approximates the forward physical model via interpolation, reducing the number of necessary nodes. Finally, we introduce a new GP model for data-driven regression that respects fundamental laws of physics via dependence-regularization, and provides consistency estimates. All models attain data-driven physics-aware modeling. Empirical evidence of performance of these models will be presented through illustrative examples of vegetation/land monitoring involving multispectral (Landsat, MODIS) and passive microwave (SMOS, SMAP) observations, as well as blending data with radiative transfer models, such as PROSAIL, SCOPE and MODTRAN.


"A Perspective on Gaussian Processes for Earth Observation". Camps-Valls et al. National Science Review 6 (4) :616-618, 2019

"Physics-aware Gaussian processes in remote sensing". Camps-Valls et al. Applied Soft Computing 68 :69-82, 2018

"A Survey on Gaussian Processes for Earth Observation Data Analysis: A Comprehensive Investigation". Camps-Valls et al. IEEE Geoscience and Remote Sensing Magazine 2016


How to cite: Camps-Valls, G., Svendsen, D., Martino, L., Pérez-Suay, A., Piles, M., and Muñoz-Marí, J.: Advances in Gaussian Processes for Earth Sciences: Physics-aware, interpretability and consistency, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-14677, https://doi.org/10.5194/egusphere-egu2020-14677, 2020

D2397 |
| Highlight
Laura Mansfield, Peer Nowack, Matt Kasoar, Richard Everitt, William Collins, and Apostolos Voularakis

Furthering our understanding of regional climate change responses to different greenhouse gas and aerosol emission scenarios is pivotal to inform societal adaptation and mitigation measures. However, complex General Circulation Models (GCMs) used for decadal to centennial climate change projections are computationally expensive. Here we have utilised a unique dataset of existing global climate model simulations to show that a novel machine learning approach can learn relationships between short-term and long-term temperature responses to different climate forcings, which in turn can accelerate climate change projections. This approach could reduce the costs of additional scenario computations and uncover consistent early indicators of long-term climate responses.

We have explored several statistical techniques for this supervised learning task and here we present predictions made with Ridge regression and Gaussian process regression. We have compared the results to pattern scaling as a standard simplified approach for estimating regional surface temperature responses under varying climate forcing scenarios. In this research, we highlight key challenges and opportunities for data-driven climate model emulation, especially with regards to the use of even larger model datasets and different climate variables. We demonstrate the potential to apply our method for gaining new insights into how and where ongoing climate change can be best detected and extrapolated; proposing this as a blueprint for future studies and encouraging data collaborations among research institutes in order to build ever more accurate climate response emulators.

How to cite: Mansfield, L., Nowack, P., Kasoar, M., Everitt, R., Collins, W., and Voularakis, A.: Can we predict global patterns of long-term climate change from short-term simulations?, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13963, https://doi.org/10.5194/egusphere-egu2020-13963, 2020

D2398 |
Yueling Ma, Carsten Montzka, Bagher Bayat, and Stefan Kollet

Groundwater is the dominant source of fresh water in many European countries. However, due to a lack of near-real-time water table depth (wtd) observations, monitoring of groundwater resources is not feasible at the continental scale. Thus, an alternative approach is required to produce wtd data from other available observations near-real-time. In this study, we propose Long Short-Term Memory (LSTM) networks to model monthly wtd anomalies over Europe utilizing monthly precipitation anomalies as input. LSTM networks are a special type of artificial neural networks, showing great promise in exploiting long-term dependencies between time series, which is expected in the response of groundwater to precipitation. To establish the methodology, spatially and temporally continuous data from terrestrial simulations at the continental scale were applied with a spatial resolution of 0.11°, ranging from the year 1996 to 2016 (Furusho-Percot et al., 2019). They were divided into a training set (1996 – 2012), a validation set (2012 – 2014) and a testing set (2015 -2016) to construct local models on selected pixels over eight PRUDENCE regions. The outputs of the LSTM networks showed good agreement with the simulation results in locations with a shallow wtd (~3m). It is important to note, the quality of the models was strongly affected by the amount of snow cover. Moreover, with the introduction of monthly evapotranspiration anomalies as additional input, pronounced improvements of the network performances were only obtained in more arid regions (i.e., Iberian Peninsula and Mediterranean). Our results demonstrate the potential of LSTM networks to produce high-quality wtd anomalies from hydrometeorological variables that are monitored at the large scale and part of operational forecasting systems potentially facilitating the implementation of an efficient groundwater monitoring system over Europe.


Furusho-Percot, C., Goergen, K., Hartick, C., Kulkarni, K., Keune, J. and Kollet, S. (2019). Pan-European groundwater to atmosphere terrestrial systems climatology from a physically consistent simulation. Scientific Data, 6(1).

How to cite: Ma, Y., Montzka, C., Bayat, B., and Kollet, S.: Modeling of groundwater table depth anomalies using Long Short-Term Memory networks over Europe, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-5367, https://doi.org/10.5194/egusphere-egu2020-5367, 2020

D2399 |
Siddhant Agarwal, Nicola Tosi, Doris Breuer, Sebastiano Padovan, and Pan Kessel

The parameters and initial conditions governing mantle convection in terrestrial planets like Mars are poorly known meaning that one often needs to randomly vary several parameters to test which ones satisfy observational constraints. However, running forward models in 2D or 3D is computationally intensive to the point that it might prohibit a thorough scan of the entire parameter space. We propose using Machine Learning to find a low-dimensional mapping from input parameters to outputs. We use about 10,000 thermal evolution simulations with Mars-like parameters run on a 2D quarter cylindrical grid to train a fully-connected Neural Network (NN). We use the code GAIA (Hüttig et al., 2013) to solve the conservation equations of mantle convection for a fluid with Newtonian rheology and infinite Prandtl number under the Extended Boussinesq Approximation. The viscosity is calculated according to the Arrhenius law of diffusion creep (Hirth & Kohlstedt, 2003). The model also considers the effects of partial melting on the energy balance, including mantle depletion of heat producing-elements (Padovan et., 2017), as well as major phase transitions in the olivine system. 

To generate the dataset, we randomly vary 5 different parameters with respect to each other: thermal Rayleigh number, internal heating Rayleigh number, activation energy, activation volume and a depletion factor for heat-producing elements in the mantle. In order to train in time, we take the simplest possible approach, i.e., we treat time as another variable in our input vector. 80% of the dataset is used to train our NN, 10% is used to test different architectures and to avoid over-fitting, and the remaining 10% is used as test set to evaluate the error of the predictions. For given values of the five parameters, our NN can predict the resulting horizontally-averaged temperature profile at any time in the evolution, spanning 4.5 Ga with an average error under 0.3% on the test set. Tests indicate that with as few as 5% of the training samples (= simulations x time steps), one can achieve a test-error below 0.5%, suggesting that for this setup, one can potentially learn the mapping from fewer simulations. 

Finally, we ran a fourth batch of GAIA simulations and compared them to the output of our NN. In almost all cases, the instantaneous predictions of the 1D temperature profiles from the NN match those of the computationally expensive simulations extremely well, with an error below 0.5%.

How to cite: Agarwal, S., Tosi, N., Breuer, D., Padovan, S., and Kessel, P.: Mars’ thermal evolution from machine-learning-based 1D surrogate modelling , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-16162, https://doi.org/10.5194/egusphere-egu2020-16162, 2020

D2400 |
Verena Bessenbacher, Lukas Gudmundsson, and Sonia I. Seneviratne

The past decades have seen massive advances in generating Earth System observations. A plethora of instruments is, at any point in time, taking remote measurements of the Earth’s surface aboard satellites. This birds-eye view of the land surface has become invaluable to the climate science and hydrology communities. However, the same variable is often observed by several platforms with contrasting results and satellite observations have non-trivial patterns of missing values. Consequently, mostly only one remote sensing product is used simultaneously. This and the inherent missingness of the datasets has led to a fragmentation of the observational record that limits the widespread use of remotely sensed land observations. We aim towards a generalized framework for gap-filling global, high-resolution remote sensing measurements relevant for the terrestrial water cycle, focusing on ESA microwave soil moisture, land surface temperature and GPM precipitation. To this end, we explore statistical imputation methods and benchmark them using a “perfect dataset approach”, in which we apply the missingness pattern of the remote sensing datasets onto their matching variables in the ERA5 reanalysis data. Original and imputed values are subsequently compared for benchmarking. Our highly modular approach iteratively produces estimates for the missing values and fits a model to the whole dataset, in an expectation-maximisation alike fashion. This procedure is repeated until the estimates for the missing data points converge. The method harnesses the highly-structured nature of gridded covarying observation datasets within the flexible function learning toolbox of data-driven approaches. The imputation utilises (1) the temporal autocorrelation and spatial neighborhood within one variable or dataset and (2) the different missingness patterns across different variables or datasets, i.e. the fact that if one variable at a given point in space and time is missing, another covarying variable might be observed and their local covariance could be learned. A method based on simple ridge regression has shown to perform best in terms of results and computational expensiveness and is able to outperform simple “ad-hoc” gapfilling procedures. This model, once thoroughly tested, will be applied to gapfill real satellite data and create an inherently consistent dataset that is based exclusively on observations.

How to cite: Bessenbacher, V., Gudmundsson, L., and Seneviratne, S. I.: Towards a generalized framework for missing value imputation of fragmented Earth observation data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-8823, https://doi.org/10.5194/egusphere-egu2020-8823, 2020

D2401 |
Tianle Yuan

Marine low clouds display rich mesoscale morphological types, distinct spatial patterns of cloud fields. Being able to differentiate low cloud morphology offers a tool for the research community to go one step beyond bulk cloud statistics such as cloud fraction and advance the understanding of low clouds. Here we report the progress of a NASA funded project that aims to create an observational record of low cloud mesoscale morphology at a near-global (60S-60N) scale. First, a training set is created by our team members manually labeling thousands of mesoscale (128x128) MODIS scenes into six different categories: stratus, closed cellular convection, disorganized convection, open cellular convection, clustered cumulus convection, and suppressed cumulus convection. Then we train a deep convolutional neural network model using this training set to classify individual MODIS scenes at 128x128 resolution, and test it on a test set. The trained model achieves a cross-type average precision of about 93%. We apply the trained model to 16 years of data over the Southeast Pacific. The resulting climatological distribution of low cloud morphology types show both expected and unexpected features and suggest promising potential for low cloud studies as a data product.

How to cite: Yuan, T.: Classifying Global Low-Cloud Morphology with a Deep Learning Model: Results and Potential Use, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-20208, https://doi.org/10.5194/egusphere-egu2020-20208, 2020

D2402 |
| Highlight
Jonathan Rizzi, Ingvild Nystuen, Misganu Debella-Gilo, and Nils Egil Søvde

Recent years are experiencing an exponential increase of remote sensing datasets coming from different sources (satellites, airplanes, UAVs) at different resolutions (up to few cm) based on different sensors (single bands sensors, hyperspectral cameras, LIDAR, …). At the same time, IT developments are allowing for the storage of very large datasets (up to Petabytes) and their efficient processing (through HPC, distributed computing, use of GPUs). This allowed for the development and diffusion of many libraries and packages implementing machine learning algorithm in a very efficient way. It has become therefor possible to use machine learning (including deep learning methods such as convolutional neural networks) to spatial datasets with the aim of increase the level of automaticity of the creation of new maps or the update of existing maps. 

Within this context, the Norwegian Institute of Bioeconomy Research (NIBIO), has started a project to test and apply big data methods and tools to support research activity transversally across its divisions.  NIBIO is a research-based knowledge institution that utilizes its expertise and professional breadth for the development of the bioeconomy in Norway. Its social mission entails a national responsibility in the bioeconomy sector, focusing on several societal challenges including: i) Climate (emission reductions, carbon uptake and climate adaptation); ii) Sustainability (environment, resource management and production within nature and society's tolerance limits); iii) Transformation (circular economy, resource efficient production systems, innovation and technology development); iv) food; and v) economy.

The presentation will show obtained results focus on land cover mapping using different methods and different dataset, include satellite images and airborne hyperspectral images. Further, the presentation will focus related on the criticalities related to automatic mapping from remote sensing dataset and importance of the availability of large training datasets.

How to cite: Rizzi, J., Nystuen, I., Debella-Gilo, M., and Søvde, N. E.: From remote sensing to bioeconomy: how big data can improve automatic map generation, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-20655, https://doi.org/10.5194/egusphere-egu2020-20655, 2020

D2403 |
Aleksandra Wolanin, Gonzalo Mateo-García, Gustau Camps-Valls, Luis Gómez-Chova, Michele Meroni, Gregory Duveiller, You Liangzhi, and Luis Guanter

Estimating crop yields is becoming increasingly relevant under the current context of an expanding world population accompanied by rising incomes in a changing climate. Crop growth, crop development, and final grain yield are all determined by environmental conditions in a complex nonlinear manner. Machine learning (ML), and deep learning (DL) methods in particular, can account for such nonlinear relations between yield and its drivers. However, they typically lack transparency and interpretability, which in the context of yield forecasting is of great relevance. Here, we explore how to benefit from the increased predictive performance of DL methods without compromising the ability to interpret how the models achieve their results for an example of the wheat yield in the Indian Wheat Belt.

We applied a convolutional neural network to multivariate time series of meteorological and satellite-derived vegetation variables at a daily resolution to estimate the wheat yield in the Indian Wheat Belt. Afterwards, the features and yield drivers learned by the model were visualized and analyzed with the use of regression activation maps. The learned features were primarily related to the length of the growing season, temperature, and light conditions during the growing season. Our analysis showed that high yields in 2012 were associated with low temperatures accompanied by sunny conditions during the growing period. The proposed methodology can be used for other crops and regions in order to facilitate application of DL models in agriculture.



Wolanin A., Mateo-Garcı́a G., Camps-Valls G., Gómez-Chova L. ,Meroni, M., Duveiller, G., You, L., Guanter L. (2020) Estimating and Understanding Crop Yields with Explainable Deep Learning in the Indian Wheat Belt. Environmental Research Letters.

How to cite: Wolanin, A., Mateo-García, G., Camps-Valls, G., Gómez-Chova, L., Meroni, M., Duveiller, G., Liangzhi, Y., and Guanter, L.: Explainable deep learning to predict and understand crop yield estimates, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-18163, https://doi.org/10.5194/egusphere-egu2020-18163, 2020

D2404 |
Maximilian Nölscher, Hartmut Häntze, Stefan Broda, Lena Jäger, Paul Prasse, and Silvia Makowski

The temporal prediction of groundwater levels plays an important role in groundwater management, such as the estimation of anthropogenic impacts as well as consequences of climatic changes. Therefore, the modeling of groundwater levels using physics-based approaches is an integral part of hydrogeology. However, data-driven approaches have only recently been used, in particular for the prediction of groundwater levels using machine learning techniques (e.g., Random Forest and Neural Networks). For this purpose, one model per observation well or time series is always set up. In order to further develop this, an approach is presented which uses a single model for the prediction of groundwater levels at several observation wells (n > 200). The model is a three-dimensional Convolutional Neural Net (CNN).
In addition to the time series of groundwater levels meteorological data on precipitation (P) and temperature (T) serves as additional input channels. The CNN "sees" not only the P- or T-value of the grid cell in which the observation well lies, but also surrounding values. This has the advantage that even influences of meteorological patterns in the spatial vicinity of the observation well on the groundwater level can be learned. The forecasts are calculated for periods up to six months. In addition to the comparison with the measured values, a comparison of the error averaged over all observation wells compared to a baseline model is used for the validation. To further improve predictability, the hyperparameters are optimized and other area data (e.g., geology, soil properties, land use) used as input. This model should form the basis for a regionalized forecast of groundwater levels.

How to cite: Nölscher, M., Häntze, H., Broda, S., Jäger, L., Prasse, P., and Makowski, S.: Using Convolutional Neural Networks for the prediction of groundwater levels, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-2874, https://doi.org/10.5194/egusphere-egu2020-2874, 2020

D2405 |
Anouar Romdhane, Scott Bunting, Jo Eidsvik, Susan Anyosa, and Per Bergmo

With increasingly visible effects of climate changes and a growing awareness of the possible consequences, Carbon Capture and Storage (CCS) technologies are gaining momentum. Currently preparations are being done in Norway for a full-scale CCS project where CO2 will be stored in a deep saline aquifer. A possible candidate for such storage is Smeaheia, located in the North Sea.

One of the main risks related to large scale storage projects is leakage of CO2 out of the storage complex. It is important to design measurement, monitoring and verification (MMV) plans addressing leakage risk together with other risks related to conformance and containment verification. In general, geophysical monitoring represents a significant part of storage monitoring costs. Tailored and cost- effective geophysical monitoring programs that consider the trade-off between value and cost are therefore required. A risk-based approach can be adopted to plan the monitoring, but another more quantitative approach coming from decision analysis is that of value of information (VOI) analysis. In such an analysis one can define a decision problem and measure the value of information as the additional value obtained by purchasing information before making the decision.

In this work, we study the VOI of seismic data in a context of CO2 storage decision making. Our goal is to evaluate when a seismic survey has the highest value when it comes to detecting a potential leakage of CO2, in a dynamic decision problem where we can either stop or continue the injection. We describe the proposed workflow and illustrate it through a constructed case study using a simplified Smeaheia model. We combine Monte Carlo and statistical regression techniques to estimate the VOI at different times. In a first stage, we define the decision problem. We then efficiently generate 10000 possible distributions of CO2 saturation using a reduced order-based reservoir simulation tool. We consider both leaking and non-leaking scenarios and account for uncertainties in petrophysical properties (porosity and permeability distributions). From the simulated saturations of CO2, we derive distributions of geophysical properties and model the corresponding seismic data. We then regress those values on the reference seismic data, to estimate the VOI. We evaluate the use of two machine learning based regression techniques- the k-nearest neighbours' regression with principal components and convolutional neural network (CNN). Both results are compared. We observe that VOI estimates obtained using the k-nearest neighbours' regressions were consistently lower than the estimates obtained using the CNN. Through bootstrapping, we show that the k-nearest neighbours approach produced more stable VOI estimates when compared to the neural networks' method. We analyse possible reasons of the high variability observed with neural networks and suggest means to mitigate them.


This publication has been produced with support from the NCCS Centre (NFR project number 257579/E20).

How to cite: Romdhane, A., Bunting, S., Eidsvik, J., Anyosa, S., and Bergmo, P.: A machine learning based monitoring framework for CO2 storage, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-11450, https://doi.org/10.5194/egusphere-egu2020-11450, 2020

D2406 |
Andrés Bell, Carlos Roberto Del-Blanco, Fernando Jaureguizar, Narciso García, and María José Jurado

Minerals are key resources for several industries, such as the manufacturing of high-performance components and the latest electronic devices. For the purpose of finding new mineral deposits, mineral interpretation is a task of great relevance in mining and metallurgy sectors. However, it is usually a long, costly, laborious, and manual procedure. It involves the characterization of mineral samples in laboratories far from the mineral deposits and it is subject to human interpretation mistakes. To address the previous problems, an automatic mineral recognition system is proposed that analyzes in real-time hyperspectral imagery acquired in different spectral ranges: VN-SWIR (Visible, Near and Short Wave Infrared) and LWIR (Long Wave Infrared). Thus, more efficient, faster, and more economic explorations are performed, by analyzing in-situ mineral deposits in the subsurface, instead of in laboratories. The developed system is based on a deep learning technique that implements a semantic segmentation neural network that considers spatial and spectral correlations. Two different databases composed by scanned drilled mineral cores from different mineral deposits have been used to evaluate the mineral interpretation capability. The first database contains hyperspectral images in the VN-SWIR range and the second one in the LWIR range. The obtained results show that the mineral recognition for the first database (VN-SWIR band) achieves an 86% in accuracy considering the following mineral classes: Actinolite, amphibole, biotite-chlorite, carbonate, epidote, saponite, whitemica and whitemica-chlorite. For the second database (LWIR band), a 90% in accuracy has been obtained with the following mineral classes: Albite, amphibole, apatite, carbonate, clinopyroxene, epidote, microcline, quartz, quartz-clay-feldspar and sulphide-oxide. The mineral recognition capability has been also compared between both spectral bands considering the common minerals in both databases. The results show a higher recognition performance in the LWIR band, achieving a 96% in accuracy, than in the VN-SWIR bands, which achieves an accuracy of 85%. However, the hyperspectral cameras covering VN-SWIR range are significantly more economic than those covering the LWIR range, and therefore making them a very interesting option for low-budget systems, but still with a good mineral recognition performance. On the other hand, there is a better recognition capability for those mineral categories with a higher number of samples in the databases, as expected. Acknowledgement: This research was funded the EIT Raw Materials through the Innovative geophysical logging tools for mineral exploration - 16350 InnoLOG Upscaling Project.

How to cite: Bell, A., Del-Blanco, C. R., Jaureguizar, F., García, N., and Jurado, M. J.: Mineral interpretation results using deep learning with hyperspectral imagery, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-19667, https://doi.org/10.5194/egusphere-egu2020-19667, 2020

D2407 |
Stephen Obrochta, Szilárd Fazekas, and Jan Morén

Imaging the split surface of sediment cores is standard procedure across a range of geoscience fields. However, obtaining high-resolution, continuous images with very little distortion has traditionally required expensive and fragile line-scanning systems that may be difficult or impossible to transport into the field. Thus many researchers take photographs of entire core sections, which may result in distortion, particularly at the upper and lower edges. Using computer vision techniques, we developed a set of open source tools for seamlessly stitching together a series of photographs, taken with any camera, of the split surface of a sediment core. The resulting composite image contains less distortion than a single photograph of the entire core section, particularly when combined with a simple camera sliding mechanism. The method allows for detection of and correction for variable camera tilt and rotation between adjacent pairs of images. We trained a deep neural network to post-processe the image to automate the tedious task of segmenting the sediment core from the background, while also detecting the location of the accompanying scale bar and cracks or other areas of coring-induced disturbance. A color reflectance record is then generated from the isolated core image, ignoring variations from e.g., cracks and voids.

How to cite: Obrochta, S., Fazekas, S., and Morén, J.: Using computer vision and deep learning for acquisition and processing of low-distortion sediment core images, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-20954, https://doi.org/10.5194/egusphere-egu2020-20954, 2020

D2408 |
Tree Species Identification in a Northern Temperate Forest in the United States Using Manifold Learning and Transfer Learning from Hyperspectral Data
Donghui Ma and Yun Shi
D2409 |
Fabian Romahn, Athina Argyrouli, Ronny Lutz, Diego Loyola, and Victor Molina Garcia

The satellites of the Copernicus program show the increasing relevance of properly handling the huge amount of Earth observation data, nowadays common in remote sensing. This is further challenging if the processed data has to be provided in near real time (NRT), like the cloud product from TROPOMI / Sentinel-5 Precursor (S5P) or the upcoming Sentinel-4 (S4) mission.

In order to solve the inverse problems that arise in the retrieval of cloud products, as well as in similar remote sensing problems, usually complex radiative transfer models (RTMs) are used. These are very accurate, however also computationally very expensive and therefore often not feasible in combination with NRT requirements. With the recent significant breakthroughs in machine learning, easier application through better software and more powerful hardware, the methods of this field have become very interesting as a way to improve the classical remote sensing algorithms.

In this presentation we show how artificial neural networks (ANNs) can be used to replace the original RTM in the ROCINN (Retrieval Of Cloud Information using Neural Networks) algorithm with sufficient accuracy while increasing the computational performance at the same time by several orders of magnitude.

We developed a general procedure which consists of smart sampling, generation and scaling of the training data, as well as training, validation and finally deployment of the ANN into the operational processor. In order to minimize manual work, the procedure is highly automated and uses latest technologies such as TensorFlow. It is applicable for any kind of RTMs and thus can be used for many retrieval algorithms like it is already done for ROCINN in S5P and will be soon for ROCINN in the context of S4. Regarding the final performance of the generated ANN, there are several critical parameters which have a high impact (e.g. the structure of the ANN). These will be evaluated in detail. Furthermore, we also show general limitations of ANNs in comparison with RTMs, how this can lead to unexpected side effects and ways to cope with these issues.

With the example of ROCINN, as part of the operational S5P and upcoming S4 cloud product, we show the great potential of machine learning techniques in improving the performance of classical retrieval algorithms and thus increasing their capability to deal with much larger data quantities. However, we also highlight the importance of a proper configuration and possible limitations.

How to cite: Romahn, F., Argyrouli, A., Lutz, R., Loyola, D., and Molina Garcia, V.: Using Machine Learning for processing Big Data of Copernicus Satellite Sensors at the Example of the TROPOMI / Sentinel-5 Precursor and Sentinel-4 Cloud Product, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-10472, https://doi.org/10.5194/egusphere-egu2020-10472, 2020

D2410 |
Pavan Kumar Jonnakuti and Udaya Bhaskar Tata Venkata Sai

Sea surface temperature (SST) is a key variable of the global ocean, which affects air-sea interaction processes. Forecasts based on statistics and machine learning techniques did not succeed in considering the spatial and temporal relationships of the time series data. Therefore, to achieve precision in SST prediction we propose a deep learning-based model, by which we can produce a more realistic and accurate account of SST ‘behavior’ as it focuses both on space and time. Our hybrid CNN-LSTM model uses multiple processing layers to learn hierarchical representations by implementing 3D and 2D convolution neural networks as a method to better understand the spatial features and additionally we use LSTM to examine the temporal sequence of relations in SST time-series satellite data. Widespread studies, based on the historical satellite datasets spanning from 1980 - present time, in Indian Ocean region shows that our proposed deep learning-based CNN-LSTM model is extremely capable for short and mid-term daily SST prediction accurately exclusive based on the error estimates (obtained from LSTM) of the forecasted data sets.

Keywords: Deep Learning, Sea Surface Temperature, CNN, LSTM, Prediction.


How to cite: Jonnakuti, P. K. and Tata Venkata Sai, U. B.: A hybrid CNN-LSTM based model for the prediction of sea surface temperature using time-series satellite data., EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-817, https://doi.org/10.5194/egusphere-egu2020-817, 2019

D2411 |
Tao Yan and Bo Chen

Establishing a reasonable and reliable dam deformation monitoring model is of great significance for effective analysis of dam deformation monitoring data and accurate assessment of dam working conditions. Firstly, the dam deformation is decomposed by the EEMD algorithm to obtain IMF components representing different characteristic scales, and different influencing factors are selected for different IMF components. Secondly, each IMF component is used as the ELM training sample to analyze, fit and predict the dam deformation component. Finally, the prediction results of each IMF component are accumulated to obtain the dam deformation prediction value. Taking a roller compacted concrete gravity dam as an example, the EEMD-ELM model is used to predict the deformation of the dam. At the same time, it is compared and analyzed with the prediction results of the BPNN model and the ELM model. The mean square error of the EMD-ELM model is 0.566, which is 54% and 14.8% lower than the BPNN model and the ELM model, indicating that the EEMD-ELM model has higher prediction accuracy and has certain application value.

Key words: dam deformation;prediction model; ensemble empirical mode decomposition; extreme learning machine

How to cite: Yan, T. and Chen, B.: Dam Deformation Prediction Based on EEMD-ELM Model, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-1220, https://doi.org/10.5194/egusphere-egu2020-1220, 2019

D2412 |
Nikolaos Ioannis Bountos, Melanie Brandmeier, and Mark Günter

Urban landscapes are characterized as the fastest changing areas on the planet. However, regularly monitoring of larger areas it is not feasible using UAVs or costly air borne data. In these situations, satellite data with a high temporal resolution and large field of view are more appropriate but suffer from the lower spatial resolution (deca-meters). In the present study we show that by using freely available Sentinel-2 data from the Copernicus program, we can extract anthropogenic features such as roads, railways and building footprints that are partly or completely on a sub-pixel level in this kind of data. Additionally, we propose a new metric for the evaluation of our methods on the sub-pixel objects. This metric measures the performance of the detection of an object while penalizing the false positive classification. Given that our training samples contain one class, we define two thresholds that represent the lower bound of accuracy for the object to be classified and the background. We thus avoid a good score in occasions where we classify correctly our object, but a wide area of the background has been included in our prediction. We investigate the performance of different deep-learning architectures for sub-pixel classification of the different infrastructure elements based on Sentinel-2 multispectral data and the labels derived from the UAV data. Our study area is located in the Rhone valley in Switzerland where very high-resolution UAV data was available from the University of Applied Sciences. Highly accurate labels for the respective classes were digitized in ArcGIS Pro and used as ground-truth for the Sentinel data. We trained different deep learning models based on state-of-the-art architectures for semantic segmentation, such as DeepLab and U-Net. Our approach focuses on the exploitation of the multi spectral information to increase the performance of the RGB channels. For that purpose, we make use of the NIR and SWIR 10m and 20m bands of the Sentinel-2 data. We investigate early and late fusion approaches and the behavior and contribution of each multi spectral band to improve the performance in comparison to only using the RGB channels. In the early fusion approach, we stack nine (RGB, NIR, SWIR) Sentinel-2 bands together, pass them from two convolutions followed by batch normalization and relu layers and then feed the tiles to DeepLab. In the late fusion approach, we create a CNN with two branches with the first branch processing the RGB channels and the second branch the NIR/SWIR bands. We use modified DeepLab layers for the two branches and then concatenate the outputs into a total output of 512 feature maps. We then reduce the dimensionality of the result into the finaloutput equal to the number of classes. The dimension reduction step happens in two convolution layers. We experiment on different settings for all of the mentioned architectures. In the best-case scenario, we achieve 89% overall accuracy. Moreover, we measure 60% building accuracy, streets accuracy 60%, railway accuracy 73%, river accuracy 92% and background accuracy 94%.

How to cite: Bountos, N. I., Brandmeier, M., and Günter, M.: Sub-pixel classification of anthropogenic features using deep-learning on Sentinel-2 data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-1242, https://doi.org/10.5194/egusphere-egu2020-1242, 2019

Chat time: Thursday, 7 May 2020, 10:45–12:30

Chairperson: S. Fiore, M. Kanevski
D2413 |
Derek Koehl, Carson Davis, Rahul Ramachandran, Udaysankar Nair, and Manil Maskey

Word embedding are numeric representations of text which capture meanings and semantic relationships in text. Embeddings can be constructed using different methods such as One Hot encoding, Frequency-based or Prediction-based approaches. Prediction-based approaches such as  Word2Vec, can be used to generate word embeddings that can capture the underlying semantics and word relationships in a corpus. Word2Vec embeddings generated from domain specific corpus have been shown in studies to both predict relationships and augment word vectors to improve classifications. We describe results from two different experiments utilizing word embeddings for Earth science constructed from a corpus of over 20,000 journal papers using Word2Vec. 

The first experiment explores the analogy prediction performance of word embeddings built from the Earth science journal corpus and trained using domain-specific vocabulary. Our results demonstrate that the accuracy of domain-specific word embeddings in predicting Earth science analogy questions outperforms the ability of general corpus embedding to predict general analogy questions. While the results are as anticipated,  the substantial increase in accuracy, particularly in the lexicographical domain was encouraging. The results point to the need for developing a comprehensive Earth science analogy test set that covers the full breadth of lexicographical and encyclopedic categories for validating word embeddings.

The second experiment utilizes the word embeddings to augment metadata keyword classifications. Metadata describing NASA datasets have science keywords that are manually assigned which can lead to errors and inconsistencies. These science keywords are controlled vocabulary and are used to aid data discovery via faceted search and relevancy ranking. Given the small size of the number of metadata records with proper description and keywords, word embeddings were used for augmentation. A fully connected neural network was trained to suggest keywords given a description text. This approach provided the best accuracy at ~76% as compared to other methods tested.

How to cite: Koehl, D., Davis, C., Ramachandran, R., Nair, U., and Maskey, M.: Exploring Earth Science Applications using Word Embeddings, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9966, https://doi.org/10.5194/egusphere-egu2020-9966, 2020

D2414 |
Xinping Bai, Zhongliang Lv, and Hui Wang

Marine Weather Bulletin is the main weather service product of China Central Meteorological Observatory. Based on five-kilometer grid forecast data, it comprehensively describes the forecast information of wind force, wind direction, sea fog level and visibility in eighteen offshore areas of China, issued three times a day. Its traditional production process is that the forecaster manually interprets the massive information from grid data, then manually describes in natural language, including the combined descriptions to highlight the overall trend, finally edits manually including inserting graphics and formatting, which causes low writing efficiency and quality deviation that cannot meet the timeliness, refinement and diversity. The automatic generation of marine weather bulletins has become an urgent business need.

This paper proposes a method of using GIS technology and natural language processing technology to develop a text feature extraction model for sea gales and sea fog, and finally using Aspose technology to automatically generate marine weather bulletins based on custom templates.

First, GIS technology is used to extract the spatiotemporal characteristics of meteorological information, which includes converting grid data into vector area data, performing GIS spatial overlay analysis and fusion analysis on the multi-level marine meteorological areas and Chinese sea areas to dig inside Information on the scale, Influence area, and time frequency of gale and fog in different geographic areas.

Next, natural language processing, as an important method of artificial intelligence, is performed on the spatiotemporal information of marine weather elements. Here, it is mainly based on statistical machine learning. By data mining from more than 1000 historical bulletins, Content planning focuses on putting large numbers of marine weather element words and cohesive words into automatic word segmentation, part-of-speech statistics and word extraction, then creating preliminarily classified text description templates of different elements. Through long machine learning processes, sentence planning refines sea area filtering and merging rules, wind force and wind direction merging rules, sea fog visibility describing rules, merging rules of different areas of the same sea area, merging rules of multiple forecast texts, etc. Based on these rules, omitting, referencing and merging methods are used to make the descriptions more smooth, natural and refined.  

Finally, based on Aspose technology, a custom template is used to automatically generate marine weather bulletins. Through file conversion, data mining, data filtering and noise removal of historical bulletins, a document template is established in which the constant domains and variable domains are divided and general formats are customized. Then use the Aspose tool to call the template, fill in its variable fields with actual information, and finally export it as an actual document.

Results show that the automatically generated text has a precise spatial description, accurate merge and no scales missed, the text sentence is smooth, semantically and grammatically correct, and conforms to forecaster's writing habits. The automatically generated bulletin effectively avoids common mistakes in manual editing and reduces many tedious manual labor. This study has been put into operation in China Central Meteorological Observatory, which has greatly mproved the efficiency of marine weather services.

How to cite: Bai, X., Lv, Z., and Wang, H.: Application of GIS technology and natural language processing technology in automatic generation of marine weather bulletin, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-6189, https://doi.org/10.5194/egusphere-egu2020-6189, 2020

D2415 |
James Hawkes, Nicolau Manubens, Emanuele Danovaro, John Hanley, Stephan Siemen, Baudouin Raoult, and Tiago Quintino
Every day, ECMWF produces ~120TiB of raw weather data, represented as a six-dimensional dataset. This data is used to produce approximately 30TiB of user-defined products, which are disseminated worldwide. The raw data is also stored in the world's largest meteorological archive (MARS), currently holding over 300 PiB of primary data -- which is also served around the world on demand. As the resolution of ECMWFs global weather models increase over the next few years, the amount of raw data produced per day will increase into the petabytes, and the distribution of products and archived data becomes impossible. In-situ, on-the-fly data extraction and processing are required to sustain and increase the accessibility of ECMWFs big weather data.
To meet these requirements, ECMWF is developing Polytope -- an open-source service which allows users to request arbitrary n-dimensional stencils ("polytopes") of data from highly-structured n-dimensional datasets. The data extraction is performed server-side (collocated with the data), allowing for large data reduction prior to transmission and less complexity for the user. For example, a user could request a polytope describing a flight path -- simultaneously crossing temporal and spatial axes. Polytope will return just a few bytes of data rather than large structured arrays of geo-spatial data which must be further post-processed by the user.
Polytope is being partly developed under LEXIS, an EU-funded Horizon 2020 project which focuses on large-scale HPC & cloud workflows. The emphasis of LEXIS is on how HPC and cloud systems interact; how they can share data; and methods to compose workflows of tasks running on both cloud and HPC systems. Polytope will be used to provide a cross-centre weather and climate data API which connects to multiple high-performance data sources across Europe, and serves multiple cloud environments with this data.
This poster will present the early developments and future vision of Polytope. It will also illustrate how it is used within the LEXIS project to enable complex weather and climate workflows, involving global forecasts, regional forecasts and cloud-based simulations.

How to cite: Hawkes, J., Manubens, N., Danovaro, E., Hanley, J., Siemen, S., Raoult, B., and Quintino, T.: Polytope: Serving ECMWFs Big Weather Data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-15048, https://doi.org/10.5194/egusphere-egu2020-15048, 2020

D2416 |
Alexander Ivanov, Timophey Samsonov, Natalia Frolova, Maria Kireeva, and Elena Povalishnikova

Hydrological regime classification of Russian Plain rivers was always done by hand and by using subjective analysis of various characteristics of a seasonal runoff. Last update to this classification was made in the early 1990s. 

In this work we make an attempt at using different machine learning methods for objective classification. Both clustering (DBSCAN, K-Means) and classification (XGBoost) methods were used to establish 1) if an established runoff types can be inferred from the data using supervised approach 2) similar clusters can be inferred from data (unsupervised approach). Monthly runoff data for 237 rivers of Russian Plain since 1945 and until 2016 were used as a dataset. 

In a first attempt dataset was divided into periods of 1945-1977 and 1978-2016 in attempt to detect changes in river water regimes due to climate change. Monthly data were transformed into following features: annual and seasonal runoff, runoff levels for different seasons, minimum and maximum values of monthly runoff, ratios of the minimum and maximum runoff compared to yearly average and others. Supervised classification using XGBoost method resulted in 90% accuracy in water regime type identification for 1945-1977 period. Shifts in water regime types for southern rivers of Russian Plain rivers in a Don region were identified by this classifier.

DBSCAN algorithm for clustering was able to identify 6 major clusters corresponding to existing water regime types: Kola peninsula, North-East part of Russian Plain and polar Urals, Central Russia, Southern Russia, arid South-East, foothills and separately higher altitudes of the Caucasus. Nonetheless a better approach was sought due to intersections of a clusters because of the continuous nature of data. Cosine similarity metric was used as an alternative way to separate river runoff types, this time for each year. Yearly cutoff also allows us to make a timeline of water regime changes over the course of 70 years. By using it as an objective ground truth we plan to remake classification and clusterization made earlier and establish an automated way to classify changes in water regime over time.

As a result, the following conclusions can be made

  1. It’s possible to train an accurate classifier based on established water regime type and apply it to detect changes in water regime types over the course of time
  2. By applying the classifier to different periods of time we can detect a shift to “southern” type of water regime in the central area of Russian Plain
  3. Despite the highly continuous nature of data it seems possible to use cosine similarity metric to separate water regime types into zones corresponding to established ones

The study was supported by the Russian Science Foundation (grant No.19-77-10032) in methods and Russian Foundation for Basic Research (grant No.18-05-60021for analyses in Arctic region 

How to cite: Ivanov, A., Samsonov, T., Frolova, N., Kireeva, M., and Povalishnikova, E.: Objective classification of changes in water regime types of the Russian Plain rivers utilizing machine learning approaches, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-11553, https://doi.org/10.5194/egusphere-egu2020-11553, 2020

D2417 |
Tom Rowan and Adrian Butler

In order to enable community groups and other interested parties to evaluate the effects of flood management, water conservation and other hydrological issues, better localised mapping is required.  Although some maps are publicly available many are behind paywalls, especially those with three dimensional features.  In this study London is used as a test case to evaluate, machine learning and rules-based approaches with opensource maps and LiDAR data to create more accurate representations (LOD2) of small-scale areas.  Machine learning is particularly well suited to the recognition of local repetitive features like building roofs and trees, while roads can be identified and mapped best using a faster rules-based approach.

In order to create a useful LOD2 representation, a user interface, processing rules manipulation and assumption editor have all been incorporated. Features like randomly assigning sub terrain features (basements) - using Monte-Carlo methods - and artificial sewage representation enable the user to grow these models from opensource data into useful model inputs. This project is aimed at local scale hydrological modelling, rainfall runoff analysis and other local planning applications.


The goal is to provide turn-key data processing for small scale modelling, which should help advance the installation of SuDs and other water management solutions, as well as having broader uses. The method is designed to enable fast and accurate representations of small-scale features (1 hectare to 1km2), with larger scale applications planned for future work.  This work forms part of the CAMELLIA project (Community Water Management for a Liveable London) and aims to provide useful tools for local scale modeller and possibly the larger scale industry/scientific user.

How to cite: Rowan, T. and Butler, A.: Towards opensource LOD2 modelling of urban spaces using an optimised machine learning and rules-based approach., EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21911, https://doi.org/10.5194/egusphere-egu2020-21911, 2020

D2418 |
Yuwen Chen and Xiaomeng Huang

Statistical approaches have been used for decades to augment and interpret numerical weather forecasts. The emergence of artificial intelligence algorithms has provided new perspectives in this field, but the extension of algorithms developed for station networks with rich historical records to include newly-built stations remains a challenge. To address this, we design a framework that combines two machine learning methods: temperature prediction based on ensemble of multiple machine learning models and transfer learning for newly-built stations. We then evaluate this framework by post-processing temperature forecasts provided by a leading weather forecast center and observations from 301 weather stations in China. Station clustering reduces forecast errors by 24.4% averagely, while transfer learning improves predictions by 13.4% for recently-built sites with only one year of data available. This work demonstrates how ensemble learning and transfer learning can be used to supplement weather forecasting.

How to cite: Chen, Y. and Huang, X.: Numerical Weather Forecast Post-processing with Ensemble Learning and Transfer Learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-3885, https://doi.org/10.5194/egusphere-egu2020-3885, 2020

D2419 |
Pu-Yun Kow, Li-Chiu Chang, and Fi-John Chang

As living standards have improved, people have been increasingly concerned about air pollution problems. Taiwan also faces the same problem, especially in the southern region. Thus, it is a crucial task to rapidly provide reliable information of air quality. This study intends to classify air quality images into, for example, “high pollution”, “moderate pollution”, or “low pollution” categories in areas of interest. In this work, we consider achieving a finer classification of air quality, i.e., more categories like 5-6 categories. To achieve our goal, we propose a hybrid model (CNN-FC) that integrates the convolutional neural network (CNN) and a fully-connected neural network for classifying the concentrations of PM2.5 and PM10 as well as the air quality index (AQI). Despite being implemented in many fields, the regression classification has, however, been rarely applied to air pollution problems. The image regression classification is useful to air pollution research, especially when some of the (more sophisticated) air quality detectors are malfunctioning. The hourly air quality datasets collected at Station Linyuan of Kaohsiung City in southern Taiwan form the case study for evaluating the applicability and reliability of the proposed CNN-FC approach. A total of 3549 datasets that contain the images (photos) and monitored data of PM2.5, PM10, and AQI are used to train and validate the constructed model. The proposed CNN-FC approach is employed to perform image regression classification by extracting important characteristics from images. The results demonstrate that the proposed CNN-FC model can provide a practical and reliable approach to creating an accurate image regression classification. The main breakthrough of this study is the image classification of several pollutants only using a single shallow CNN-FC model.

Keywords: PM2.5 forecast; image classification; Deep learning; Convolutional neural network; Fully-connected neural network; Taiwan


How to cite: Kow, P.-Y., Chang, L.-C., and Chang, F.-J.: Image Regression Classification of Air Quality by Convolutional Neural Network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7275, https://doi.org/10.5194/egusphere-egu2020-7275, 2020

D2420 |
Wei Tang, Wen-fang Zhao, Runsheng Lin, and Yong Zhou

In order to improve the accuracy of PM2.5 concentration forecast in Beijing Meteorological Bureau, a deep learning prediction model based on convolutional neural network (CNN) and long short term memory neural network (LSTM) was proposed. Firstly, the feature vectors extraction was carried out by using the correlation analysis technique from meteorological data such as temperature, wind, relative humidity, precipitation, visibility and atmospheric pressure. Secondly, taking into account the fact that PM2.5 concentration was significantly affected by surrounding meteorological impact factors, meteorological grid analysis data was novel involved into the model, as well as the historical PM2.5 concentration data and meteorological observation data of the present station. Spatio-temporal sequence data was generated from these data after integrated processing. High level spatio-temporal features were extracted through the combination of the CNN and LSTM. Finally, future 24-hour prediction of PM2.5 concentration was made by the model. The comparison among the accuracy of this optimized model, support vector machine (SVM) and existing PM2.5 forecast system is performed to evaluate their performance. The results show that the proposed CNN-LSTM model performs better than SVM and current operational models in Beijing Meteorological Bureau, which has effectively improved the prediction accuracy of PM2.5 concentration for different time predictions scales in the next 24 hours.

How to cite: Tang, W., Zhao, W., Lin, R., and Zhou, Y.: Forecasting Model of Short-term PM2.5 Concentration based on Deep Learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-5429, https://doi.org/10.5194/egusphere-egu2020-5429, 2020

D2421 |
Franck Albinet, Amelia Lee Zhi Yi, Petra Schmitter, Romina Torres Astorga, and Gerd Dercon


The usage of mathematical models and mid-infrared (MIR) spectral databases to predict the elemental composition of soil allows for rapid and high-throughput characterization of soil properties. The Partial Least Square Regression (PLSR) is a pervasive statistical method that is used for such predictive mathematical models due to a large existing knowledge base paired with standardized best practices in model application. Despite its ability to transform data in the high-dimensional space (high spectral resolution) to a space of fewer dimensions that captures the correlation between the input space (spectra) and the response variables (elemental soil composition), this popular approach fails to capture non-linear patterns. Further, PLSR has poor prediction capacities for a wide range of soil analytes such as Potassium and Phosphorus, just to mention a few. In addition, prediction is highly sensitive to pre-processing steps in data derivation that can also be tainted by human biases based on the empirical selection of wavenumber regions. Thus, the usage of PLSR as a methodology for elemental prediction of soil remains time-consuming and limited in scope.

With major breakthroughs in the area of Deep Learning (DL) in the past decade, soil science researchers are increasingly shifting their focus from traditional techniques such as PLSR to using DL models such as Convolutional Neural Networks. Promising results of this shift have been showcased, including increased prediction accuracy, reduced needs for data pre-processing, and improved evaluation of explanatory factors. Increasingly, studies are also looking to expand beyond the regional scope and support higher resolution and more accurate databases for global modelling efforts. However, the setup of a DL model is notoriously data intensive and often said to be less applicable when there is limited data available. While a MIR spectra database has been recently publicly released by the Kellog Soil Survey Laboratory, United States Department of Agriculture, such large-scale initiative remains a niche and focus only on specific regions and/or ecosystem types.

This research is a first effort in applying DL techniques in a relative data scarce environment (approximately 1000 labelled spectra) using transfer learning and domain-specific data augmentation techniques. In particular, we assess the potential of unsupervised feature learning approaches as a key enabler for broader applicability of DL techniques in the context of MIR spectroscopy and soil sciences. A better understanding of potential for DL methods in soil composition prediction will greatly advance the work of soil sciences and natural resource management. Improvements to overcome its associated challenges will be a step forward in creating a universal soil modelling technique through reusable models and contribute to a large world-wide soil MIR spectral database.

How to cite: Albinet, F., Lee Zhi Yi, A., Schmitter, P., Torres Astorga, R., and Dercon, G.: Exploring the Applicability of Deep Learning Methods in Mid-infrared Spectroscopy for Soil Property Predictions, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-4659, https://doi.org/10.5194/egusphere-egu2020-4659, 2020

D2422 |
Magdalena Mittermeier, Émilie Bresson, Dominique Paquin, and Ralf Ludwig

Climate change is altering the Earth’s atmospheric circulation and the dynamic drivers of extreme events. Extreme weather events pose a great potential risk to infrastructure and human security. In Southern Québec, freezing rain is among the rare, yet high-impact events that remain particularly difficult to detect, describe or even predict.

Large climate model ensembles are instrumental for a profound analysis of extreme events, as they can be used to provide a sufficient number of model years. Due to the physical nature and the high spatiotemporal resolution of regional climate models (RCMs), large ensembles can not only be employed to investigate the intensity and frequency of extreme events, but they also allow to analyze the synoptic drivers of freezing rain events and to explore the respective dynamic alterations under climate change conditions. However, several challenges remain for the analysis of large RCM ensembles, mainly the high computational costs and the resulting data volume, which requires novel statistical methods for efficient screening and analysis, such as deep neural networks (DNN). Further, to date, only the Canadian Regional Climate Model version 5 (CRCM5) is simulating freezing rain in-line using a diagnostic method. For the analysis of freezing rain in other RCMs, computational intensive, off-line diagnostic schemes have to be applied to archived data. Another approach for freezing rain analysis focuses on the relation between synoptic drivers at 500 hPa resp. sea level pressure and the occurrence of freezing rain in the study area of Montréal.

Here, we explore the capability of training a deep neural network on the detection of the synoptic patterns associated with the occurrence of freezing rain in Montréal. This climate pattern detection task is a visual image classification problem that is addressed with supervised machine learning. Labels for the training set are derived from CRCM5 in-line simulations of freezing rain. This study aims to provide a trained network, which can be applied to large multi-model ensembles over the North American domain of the Coordinated Regional Climate Downscaling Experiment (CORDEX) in order to efficiently filter the climate datasets for the current and future large-scale drivers of freezing rain.

We present the setup of the deep learning approach including the network architecture, the training set statistics and the optimization and regularization methods. Additionally, we present the classification results of the deep neural network in the form of a single-number evaluation metric as well as confusion matrices. Furthermore, we show analysis of our training set regarding false positives and false negatives.

How to cite: Mittermeier, M., Bresson, É., Paquin, D., and Ludwig, R.: Detecting Synoptic Patterns related to Freezing Rain in Montréal using Deep Learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7819, https://doi.org/10.5194/egusphere-egu2020-7819, 2020

D2423 |
An-Sheng Lee, Dirk Enters, Sofia Ya Hsuan Liou, and Bernd Zolitschka

Sediment facies provide vital information for the reconstruction of past environmental variability. Due to rising interest for paleoclimate data, sediment surveys are continually growing in importance as well as the amount of sediments to be discriminated into different facies. The conventional approach is to macroscopically determine sediment structure and colour and combine them with physical and chemical information - a time-consuming task heavily relying on the experience of the scientist in charge. Today, rapidly generated and high-resolution multiproxy sediment parameters are readily available from down-core scanning techniques and provide qualitative or even quantitative physical and chemical sediment properties. In 2016, an interdisciplinary research project WASA (Wadden Sea Archive) was launched to investigate palaeo-landscapes and environments of the Wadden Sea. The project has recovered 92 up to 5 m long sediment cores from the tidal flats, channels and off-shore around the island of Norderney (East Frisian Wadden Sea, Germany). Their facies were described by the conventional approach into glacioflucial sands, moraine, peat, tidal deposits, shoreface sediments, etc. In this study, those sediments were scanned by a micro X-ray fluorescence (µ-XRF) core scanner to obtain high-resolution records of multi-elemental data (2000 µm) and optical images (47 µm). Here we propose a supervised machine-learning application for the discrimination of sediment facies using these scanning data. Thus, the invested time and the potential bias common for the conventional approach can be reduced considerably. We expect that our approach will contribute to developing a more comprehensive and time-efficient automatic sediment facies discrimination.

Keywords: the Wadden Sea, µ-XRF core scanning, machine-learning, sediment facies discrimination

How to cite: Lee, A.-S., Enters, D., Liou, S. Y. H., and Zolitschka, B.: Artificial intelligence for discrimination of sediment facies based on high-resolution elemental and colour data from coastal sediments of the East Frisian Wadden Sea, Germany, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-3984, https://doi.org/10.5194/egusphere-egu2020-3984, 2020

D2424 |
Daniel Galea, Bryan Lawrence, and Julian Kunkel

Finding and identifying important phenomena in large volumes of simulation data consumes time and resources. Deep Learning offers a route to improve speeds and costs. In this work we demonstrate the application of Deep Learning in identifying data which contains various classes of tropical cyclone. Our initial application is in re-analysis data, but the eventual goal is to use this system during numerical simulation to identify data of interest before writing it out.

A Deep Learning model has been developed to help identify data containing varying intensities of tropical cyclones. The model uses some convolutional layers to build up a pattern to look for, and a fully-connected classifier to predict whether a tropical cyclone is present in the input. Other techniques such as batch normalization and dropout were tested. The model was trained on a subset of the ERA-Interim dataset from the 1st of January 1979 until the 31st of July 2017, with the relevant labels obtained from the IBTrACS dataset. The model obtained an accuracy of 99.08% on a test set, which was a 20% subset of the original dataset. 

An advantage of this model is that it does not rely on thresholds set a priori, such as a minimum of sea level pressure, a maximum of vorticity or a measure of the depth and strength of deep convection, making it more objective than previous detection methods. Also, given that current methods follow non-trivial algorithms, the Deep Learning model is expected to have the advantage of being able to get the required prediction much quicker, making it viable to be implemented into an existing numerical simulation.

Most current methods also apply different thresholds for different basins (planetary regions). In principle, the globally trained model should avoid the necessity for such differences, however, it was found that while differing thresholds were not required, training data for specific regions was required to get similar accuracy when only individual basins were examined.

The existing version, with greater than 99% accuracy globally and around 91% when trained only on cases from the Western Pacific and Western Atlantic basins, has been trained on ERA-Interim data. The next steps with this work will involve assessing the suitability of the pre-trained model for different data, and deploying it within a running numerical simulation.

How to cite: Galea, D., Lawrence, B., and Kunkel, J.: Detecting Tropical Cyclones using Deep Learning Techniques , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9870, https://doi.org/10.5194/egusphere-egu2020-9870, 2020

D2425 |
Tai-Chen Chen, Li-Chiu Chang, and Fi-John Chang

The frequency of extreme hydrological events caused by climate change has increased in recent years. Besides, most of the urban areas in various countries are located on low-lying and flood-prone alluvial plains such that the severity of flooding disasters and the number of affected people increase significantly. Therefore, it is imperative to explore the spatio-temporal variation characteristics of regional floods and apply them to real-time flood forecasting. Flash floods are common and difficult to control in Taiwan due to several geo-hydro-meteorological factors including drastic changes in topography, steep rivers, short concentration time, and heavy rain. In recent decades, the emergence of artificial intelligence (AI) and machine learning techniques have proven to be effective in tackling real-time climate-related disasters. This study combines an unsupervised and competitive neural network, the self-organizing map (SOM), and the dynamic neural networks to make regional flood inundation forecasts. The SOM can be used to cluster high-dimensional historical flooding events and map the events onto a two-dimensional topological feature map. The topological structure displayed in the output space is helpful to explore the characteristics of the spatio-temporal variation of different flood events in the investigative watershed. The dynamic neural networks are suitable for forecasting time-vary systems because its feedback mechanism can keep track the most recent tendency. The results demonstrate that the real-time regional flood inundation forecast model combining SOM and dynamic neural networks can more quickly extract the characteristics of regional flood inundation and more accurately produce multi-step ahead flood inundation forecasts than the traditional methods. The proposed methodology can provide spatio-temporal information of flood inundation to decision makers and residents for taking precautionary measures against flooding.

Keywords: Artificial neural network (ANN); Self-organizing map (SOM); Dynamic neural networks; Regional flood; Spatio-temporal distribution

How to cite: Chen, T.-C., Chang, L.-C., and Chang, F.-J.: Regional flood forecasting based on the spatio-temporal variation characteristics using hybrid SOM and dynamic neural networks, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7276, https://doi.org/10.5194/egusphere-egu2020-7276, 2020

D2426 |
John Hanley, Stephan Siemen, James Hawkes, Milana Vuckovic, Tiago Quintino, and Florian Pappenberger

Weather forecasts produced by ECMWF and environmental services by the Copernicus programme act as a vital input for many downstream simulations and applications. A variety of products, such as ECMWF reanalyses and archived forecasts, are additionally available to users via the MARS archive and the Copernicus data portal. Transferring, storing and locally modifying large volumes of such data prior to integration currently presents a significant challenge to users. The key aim for ECMWF within the H2020 HiDALGO project (https://hidalgo-project.eu/) is to migrate these tasks to the cloud, thereby facilitating fast and seamless application integration by enabling precise and efficient data delivery to the end-user. The required cloud infrastructure development will also feed into ECMWF's contribution to the European Weather Cloud pilot which is a collaborative cloud development project between ECMWF and EUMETSAT.

The HiDALGO project aims to implement a set of services and functionality to enable the simulation of complex global challenges which require massive high performance computing resources alongside state-of-the-art data analytics and visualization. The HiDALGO use-case workflows are comprised of four main components: pre-processing, numerical simulation, post-processing and visualization. The core simulations are ideally suited to running in a dedicated HPC environment, while the pre-/post-processing and visualisation tasks are generally well suited to running in a cloud environment. Enabling and efficiently managing and orchestrating the integration of both HPC and cloud environments to improve overall performance and functionality is the key goal of HiDALGO.

ECMWF's role in the project will be to enable seamless integration of two pilot applications with its meteorological data and services (such as data exploration, analysis and visualisation) delivered via ECMWF's cloud and orchestrated by bespoke HiDALGO workflows. The demonstrated workflows show the increased value which can be created from weather forecasts, but also derived forecasts for air quality which are provided by the Copernicus Atmospheric Monitoring Service (CAMS).

This poster will give a general overview of HiDALGO project and its main aims and objectives. It will present the two test pilot applications which will be used for integration, and an overview of the general workflows and services within HiDALGO. In particular, it will focus on how ECMWF's cloud data and services will couple with the test pilot applications thereby improving overall workflow performance and enabling access to new data and products for the pilot users.

How to cite: Hanley, J., Siemen, S., Hawkes, J., Vuckovic, M., Quintino, T., and Pappenberger, F.: Building Cloud-Based Data Services to Enable Earth Science Workflows , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-10297, https://doi.org/10.5194/egusphere-egu2020-10297, 2020

D2427 |
Yang Chong, Dongqing Zhao, Guorui Xiao, Minzhi Xiang, Linyang Li, and Zuoping Gong

The selection of adaptive region of geomagnetic map is an important factor that affects the positioning accuracy of geomagnetic navigation. An automatic recognition and classification method of adaptive region of geomagnetic background field based on Principal Component Analysis (PCA) and GA-BP neural network is proposed. Firstly, PCA is used to analyze the geomagnetic characteristic parameters, and the independent characteristic parameters containing principal components are selected. Then, the GA-BP neural network model is constructed, and the correspondence between the geomagnetic characteristic parameters and matching performance is established, so as to realize the recognition and classification of adaptive region. Finally, Simulation results show that the method is feasible and efficient, and the positioning accuracy of geomagnetic navigation is improved.

How to cite: Chong, Y., Zhao, D., Xiao, G., Xiang, M., Li, L., and Gong, Z.: The selection of adaptive region of geomagnetic map based on PCA and GA-BP neural network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-8423, https://doi.org/10.5194/egusphere-egu2020-8423, 2020

D2428 |
Suwei Yang, Kuldeep S Meel, and Massimo Lupascu

Over the last decades we are seeing an increase in forest fires due to deforestation and climate change. In Southeast Asia, tropical peatland forest fires are a major environmental issue having a significant effect on the climate and causing extensive social, health and economical impacts. As a result, forest fire prediction has emerged as a key challenge in computational sustainability. Existing forest fire prediction systems, such as the Canadian Forest Fire Danger Rating System (Natural Resources Canada), are based on handcrafted features and use data from instruments on the ground. However, data from instruments on the ground may not always be available. In this work, we propose a novel machine learning approach that uses historical satellite images to predict forest fires in Indonesia. Our prediction model achieves more than 0.86 area under the receiver operator characteristic(ROC) curve. Further evaluations show that the model's prediction performance remains above 0.81 area under ROC curve even with reduced data. The results support our claim that machine learning based approaches can lead to reliable and cost-effective forest fire prediction systems.

How to cite: Yang, S., Meel, K. S., and Lupascu, M.: Predicting forest fire in Indonesia using remote sensing data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13191, https://doi.org/10.5194/egusphere-egu2020-13191, 2020

D2429 |
Martin Hendrick, Cristina Pérez-Guillén, Alec van Herwijnen, and Jürg Schweizer

Assessing and forecasting avalanche hazard is crucial for the safety of people and infrastructure in mountain areas. Over 20 years of data covering snow precipitation, snowpack properties, weather, on-site observations, and avalanche danger has been collected in the context of operational avalanche forecasting for the Swiss Alps. The quality and breadth of this dataset makes it suitable for machine learning techniques.

Forecasters mainly process a huge and redundant dataset "manually" to produce daily avalanche bulletins during the winter season. The purpose of this work is to provide the forecasters automated tools to support their work. 

By combining clustering and classification algorithms, we are able to reduce the amount of information that needs to be processed and identify relevant weather and snow patterns that characterize a given avalanche situation.

How to cite: Hendrick, M., Pérez-Guillén, C., van Herwijnen, A., and Schweizer, J.: Machine learning as a tool for avalanche forecasting, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13419, https://doi.org/10.5194/egusphere-egu2020-13419, 2020

D2430 |
Youyue Sun, Yu Li, Jinhui Jeanne Huang, and Edward McBean

Chlorophyll-a (CHLA) and total phosphorous (TP) are key indicators for water quality and eutrophication in lakes. It would be a great help to water management if CHLA and TP could be predicted with certain leading time to ensure water quality control measures could be implemented. Since eutrophication is the results of a complex bio-chemical-physical processes involving in pH, temperature, dissolved oxygen (DO) and many other water quality parameters, the discover of their internal correlations and relationships may help in the predication of CHLA and TP. In this study, a long term (20 years) water quality data including CHLA, TP, total nitrogen (TN), turbidity (TB), sulphate, pH, and DO collected in Lake Ontario by the Environment and Climate Change Canada agency were obtained. These data were analyzed by using a group of Neural Network (NN) models and ensemble strategies were evaluated in this study. One particular ensemble of the following NN models, namely, back propagation, Kohonen, probabilistic neural network (PNN), generalized regression neural network (GRNN), or group method of data handling (GMDH) were selected which has higher goodness of fit and shows robustness in model validation. Comparing with one single NN model, the ensemble model could provide more accurate predictions of CHLA and TP concentration in Lake Ontario and the predication of CHLA and TP would be helpful in lake management, eco-restoration and public health risk assessment.

How to cite: Sun, Y., Li, Y., Huang, J. J., and McBean, E.: Prediction of Chlorophyll and Phosphorus in Lake Ontario by Ensemble of Neural Network Models, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-1870, https://doi.org/10.5194/egusphere-egu2020-1870, 2019

D2431 |
Junhwa Chi, Hyun-Cheol Kim, and Sung Jae Lee

Changes in Arctic sea ice cover represent one of the most visible indicators of climate change. While changes in sea ice extent affect the albedo, changes in sea ice volume explain changes in the heat budget and the exchange of fresh water between ice and the ocean. Global climate simulations predict that Arctic sea ice will exhibit a more significant change in volume than extent. Satellite observations show a long-term negative trend in Arctic sea ice  during all seasons, particularly in summer. Sea ice volume has been estimated by ICESat and CryoSat-2 satellites, and then NASA’s second-generation spaceborne lidar mission, ICESat-2 has successfully been launched in 2018.  Although these sensors can measure sea ice freeboard precisely, long revisit cycles and narrow swaths are problematic for monitoring of the freeboard in the entire of Arctic ocean effectively. Passive microwave sensors are widely used in retrieval of sea ice concentration. Because of the capability of high temporal resolution and wider swaths, these sensors enable to produce daily sea ice concentration maps over the entire Arctic ocean. Brightness temperatures from passive microwave sensors are often used to estimate sea ice freeboard for first-year ice, but it is difficult to associate with physical characteristics related to sea ice height of multi-year ice. In machine learning community, deep learning has gained attention and notable success in addressing more complicated decision making using multiple hidden layers. In this study, we propose a deep learning based Arctic sea ice freeboard retrieval algorithm incorporating the brightness temperature data from the AMSR2 passive microwave data and sea ice freeboard data from the ICESat-2. The proposed retrieval algorithm enables to estimate daily freeboard for both first- and multi-year ice over the entire Arctic ocean. The estimated freeboard values from the AMSR2 are then quantitatively and qualitatively compared with other sea ice freeboard or thickness products.  

How to cite: Chi, J., Kim, H.-C., and Lee, S. J.: Retrieval of Arctic sea ice freeboard from passive microwave data using deep neural network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-2254, https://doi.org/10.5194/egusphere-egu2020-2254, 2020

D2432 |
Hitoshi Miyamoto, Takuya Sato, Akito Momose, and Shuji Iwami

This presentation examined a new method for classifying riverine land covers by using the machine learning technique applied to both the satellite and UAV (Unmanned Aerial Vehicle) images in a Kurobe River channel.  The method used Random Forests (RF) for the classification with RGBs and NDVIs (Normalized Difference Vegetation Index) of the images in combination.  In the process, the high-resolution UAV images made it possible to create accurate training data for the land cover classification of the low-resolution satellite images.  The results indicated that the combination of the high- and low-resolution images in the machine learning could effectively detect waters, gravel/sand beds, trees, and grasses from the satellite images with a certain degree of accuracy.  In contrast, the usage of only low-resolution satellite images failed to detect the vegetation difference between trees and grasses.  These results could actively support the effectiveness of the present machine learning method in the combination of satellite and UAV images to grasp the most critical areas in riparian vegetation management.

How to cite: Miyamoto, H., Sato, T., Momose, A., and Iwami, S.: Riverscape classification by using machine learning in combination with satellite and UAV images, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9428, https://doi.org/10.5194/egusphere-egu2020-9428, 2020

D2433 |
Stav Nahum, Shira Raveh-Rubin, Jonathan Shlomi, and Vered Silverman

Dry-air intrusions (DIs) descending from the upper troposphere toward the surface are often associated with abrupt modification of the atmospheric boundary layer,air-sea interface, and high impact weather events. Understanding the triggering mechanism of DIs is important to predict the likelihood of their occurrence in both weather forecasts and future climate projections.

The current identification method of DIs is based on a systematic costly Lagrangian method that requires high vertical resolution of the wind field at sub-daily intervals. Therefore, the accurate prediction of surface weather conditions is potentially limited. Moreover, large case to case variability of these events makes it challenging to compose an objective algorithm for predicting the timing and location of their initiation.    

Here we test the ability of deep neural networks, originally designed for computer vision purposes, to identify the DI phenomenon based on instantaneous 2-dimensional maps of commonly available atmospheric parameters. Our trained neural network is able to successfully predict DI origins using three instantaneous 2-D maps of geopotential heights.

Our results demonstrate how machine learning can be used to overcome the limitations of the traditional identification method, introducing the possibility to evaluate and quantify the occurrence of DIs instantaneously, avoiding costly computations and the need for high resolution data sets which are not available for most atmospheric data sets. In particular, for the first time, it is possible to predict the occurrence of DI events up to two days before the actual descent is complete.

How to cite: Nahum, S., Raveh-Rubin, S., Shlomi, J., and Silverman, V.: Can we predict Dry Air Intrusions using an Artificial Neural Network?, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-20430, https://doi.org/10.5194/egusphere-egu2020-20430, 2020

D2434 |
Valentin Haselbeck, Jannes Kordilla, Florian Krause, and Martin Sauter

Growing datasets of inorganic hydrochemical analyses together with large differences in the measured concentrations raise the demand for data compression while maintaining critical information. The data should subsequently be displayed in an orderly and understandable way. Here, a type of artificial neural network, Kohonen’s self-organizing map (SOM), is trained on inorganic hydrochemical data. Based on this network, clusters are built and associated to the salinity source distribution of the spatial variation at a former potash mining site. This combined two-step clustering approach managed to assign the groundwater analyses automatically to five different clusters, three geogenic and two anthropogenic, according to their inorganic chemical composition. The spatial distribution of the SOM clusters helps to understand the large scale hydrogeological context. This approach provides the hydrogeologist with a tool to quickly and automatically analyze large datasets and present them in a clear and comprehensible format.

How to cite: Haselbeck, V., Kordilla, J., Krause, F., and Sauter, M.: Hydrochemical Classification of Groundwater with Artificial Neural Networks, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-22148, https://doi.org/10.5194/egusphere-egu2020-22148, 2020

D2435 |
Veronique Michot, Helene Brogniez, Mathieu Vrac, Soulivanh Thao, Helene Chepfer, Pascal Yiou, and Christophe Dufour

The multi-scale interactions at the origin of the links between clouds and water vapour are essential for the Earth's energy balance and thus the climate, from local to global. Knowledge of the distribution and variability of water vapour in the troposphere is indeed a major issue for the understanding of the atmospheric water cycle. At present, these interactions are poorly known at regional and local scales, i.e. within 100km, and are therefore poorly represented in numerical climate models. This is why we have sought to predict cloud scale relative humidity profiles in the intertropical zone, using a non-parametric statistical downscaling method called quantile regression forest. The procedure includes co-located data from 3 satellites: CALIPSO lidar and CloudSat radar, used as predictors and providing cloud properties at 90m and 1.4km horizontal resolution respectively; SAPHIR data used as a predictor and providing relative humidity at an initial horizontal resolution of 10km. Quantile regression forests were used to predict relative humidity profiles at the CALIPSO and CloudSat scales. These predictions are able to reproduce a relative humidity variability consistent with the cloud profiles and are confirmed by values of coefficients of determination greater than 0.7, relative to observed relative humidity, and Continuous Rank Probability Skill Score between 0 and 1, relative to climatology. Lidar measurements from the NARVAL 1&2 campaigns and radiosondes from the EUREC4A campaigns were also used to compare Relative Humidity profiles at the SAPHIR scale and at the scale of forest regression prediction by quantile regression.

How to cite: Michot, V., Brogniez, H., Vrac, M., Thao, S., Chepfer, H., Yiou, P., and Dufour, C.: Estimation of fine-scale relative humidity profiles: an issue for understanding the atmospheric water cycle, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21947, https://doi.org/10.5194/egusphere-egu2020-21947, 2020

D2436 |
Samuel Jackson, Jeyarajan Thiyagalingam, and Caroline Cox

Clouds appear ubiquitously in the Earth's atmosphere, and thus present a persistent problem for the accurate retrieval of remotely sensed information. The task of identifying which pixels are cloud, and which are not, is what we refer as the cloud masking problem. The task of cloud masking essentially boils down to assigning a binary label, representing either "cloud" or "clear", to each pixel.

Although this problem appears trivial, it is often complicated by a diverse number of issues that affect the imagery obtained from remote sensing instruments. For instance, snow, sea ice, dust, smoke, and sun glint can easily challenge the robustness and consistency of any cloud masking algorithm. The cloud masking problem is also further complicated by geographic and seasonal variation in acquired scenes.

In this work, we present a machine learning approach to handle the problem of cloud masking for the Sea and Land Surface Temperature Radiometer (SLSTR) on board the Sentinel-3 satellites. Our model uses Gradient Boosting Decision Trees (GBDTs), to perform pixel-wise segmentation of satellite images. The model is trained using a hand labelled dataset of ~12,000 individual pixels covering both the spatial and temporal domains of the SLSTR instrument and utilises the combined channels of the dual-view swaths. Pixel level annotations, while lacking spatial context, have the advantage of being cheaper to obtain compared to fully labelled images, a major problem in applying machine learning to remote sensing imagrey.

We validate the performance of our mask using cross validation and compare its performance with two baseline models provided in the SLSTR level 1 product. We show up to 10% improvement in binary classification accuracy compared with the baseline methods. Additionally, we show that our model has the ability to distinguish between different classes of cloud to reasonable accuracy.

How to cite: Jackson, S., Thiyagalingam, J., and Cox, C.: A Machine Learning Approach to Cloud Masking in Sentinel-3 SLSTR Data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21593, https://doi.org/10.5194/egusphere-egu2020-21593, 2020

D2437 |
Taeyong Kim and Minjune Yang

Since the mid-twentieth century, geology in South Korea has considerably advanced as a scientific discipline. Over the past few decades, geology has interacted with physical or engineering viewpoints. So, modern geology needs to be interpreted with an interdisciplinary perspective. This study aimed to classify geology’s academic subdisciplines in Korean and analyze the evolutionary trend of each subdiscipline in South Korea for 54 years from 1964 through 2018. In preprocessing, we collected 13,266 titles from 10 of Korean geological journals and deleted the words that do not require. After that, we classified geologic subdisciplines by Latent Dirichlet Allocation (LDA), a good tool to find topics in text data. According to the result of this study, the optimal number of subdisciplines in LDA was nine (mineralogy, petrology, sedimentology, economic geology, geotechnical engineering, engineering geology, environmental geology, geophysics, seismology). We then calculated the annual proportion from each subdiscipline to investigate evolutionary trends using polynomial regression. Results showed that mineralogy, petrology, sedimentology, and economic geology proportions increased in 1980. Geotechnical engineering and engineering geology proportions increased in 1990. Environmental geology, geophysics, and seismology proportions increased in 1995. The results of this study fill an important gap in understanding the research trends of geologic subdisciplines in South Korea, showing their emergence, growth and diminution.

How to cite: Kim, T. and Yang, M.: Analysis of research trends using Latent Dirichlet Allocation for geologic subdisciplines in South Korea, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21204, https://doi.org/10.5194/egusphere-egu2020-21204, 2020

D2438 |
Giulia Cremonini, Giovanni Besio, Daniele Lagomarsino, and Agnese Seminara

Reliable forecast of environmental variables is fundamental in managing
risk associated with hazard scenarios. In this work, we use state of the
art machine learning algorithms to build forecasting models and to get
accurate estimation of sea wave conditions. We exploit multivariate time
series of environmental variables, extracted either from hindcast
database (provided by MeteOcean Group at DICCA) or observed data from
sparse buoys. In this way, future values of sea wave height can be
predicted in order to evaluate the risk associated with incoming
scenarios. The aim is to provide new forecasting tools representing an
alternative to physically based models which have higher computational

How to cite: Cremonini, G., Besio, G., Lagomarsino, D., and Seminara, A.: A machine learning approach to achieve accurate time series forecast of sea-wave conditions, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-22666, https://doi.org/10.5194/egusphere-egu2020-22666, 2020

D2439 |
João Felipe Cardoso dos Santos, Dimitry Van Der Zande, and Nabil Youdjou

Earth Observation (EO) data availability is drastically increasing thanks to the Copernicus Sentinel missions. In 2014 Sentinel data volumes were approximately 200 TB (one operational mission) while in 2019 these volumes rose to 12 PB (nine operational missions) and will increase further with the planned launch of new Sentinel satellites. Dealing with this big data evolution has become an additional challenge in the development of downstream services next to algorithm development, product quality control, and data dissemination techniques.

The H2020 project ‘Data Cube Service for Copernicus (DCS4COP)’ addresses the downstream challenges of big data integrating Copernicus services in a data cube system. A data cube is typically a four-dimensions object, with a parameter dimension and three shared dimensions (time, latitude, longitude). The traditional geographical map data is transformed into a data cube based with user-defined spatial and temporal resolutions using tools such as mathematical operations, sub-setting, resampling, or gap filling to obtain a set of consistent parameters.

This work describes how different EO datasets are integrated in a data cube system to monitor the water quality in the Belgian Continental Shelf (BCS) for a period from 2017 to 2019. The EO data sources are divided in four groups: 1) high resolution data with low temporal coverage (i.e. Sentinel-2), 2) medium resolution data with daily coverage (i.e. Sentinel-3), 3) low resolution geostationary data with high coverage frequency (i.e. MSG-SEVIRI), and 4) merged EO data with different spatial and temporal information acquired from CMEMS. Each EO dataset from group 1 to 3 has its own thematic processor that is responsible for the acquisition of Level 1 data, the application of atmospheric corrections and a first quality control (QC) resulting in a Level 2 quality-controlled remote sensing reflectance (Rrs) product. The Level 2 Rrs is the main product used to generate other ocean colour products such as chlorophyll-a and suspended particulate matter. Each product generated from the Rrs passed by a second QC related to its characteristic and improvements (when applied) and organized in a common data format and structure to facilitate the direct integration into a product and sensor specific. At the end of the process, these products are defined as quality-controlled analysis ready data (ARD) and are ingested in the data cube system enabling fast and easy access to these big data volumes of multi-scale water quality products for further analysis (i.e. downstream service). The data cube system grants a fast and easy straightforward access converting netCDF data to Zarr and placing it on the server. In Zarr datasets, the object is divided into chunks and compressed while the metadata are stored in light weight .json files. Zarr works well on both local filesystems and cloud-based object stores which makes it possible to use through a variety of tools such as an interactive data viewer or jupyter notebooks.

How to cite: Cardoso dos Santos, J. F., Van Der Zande, D., and Youdjou, N.: The data cube system to EO datasets: the DCS4COP project, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21634, https://doi.org/10.5194/egusphere-egu2020-21634, 2020

D2440 |
Andras Fabian, Carine Bruyninx, Juliette Legrand, and Anna .Miglio

Global Navigation Satellite Systems (GNSS) are a widely spread cost effective technique for geodetic applications and monitoring the Earth’s atmosphere. Therefore, the density of the GNSS networks have grown considerable since the last decade. Each of the networks collects huge amounts of data from permanently operating GNSS stations. The quality of the data is variable, depending on the evaluated time period and satellite system. Conventionally, the quality information is extracted from daily estimates of different types of GNSS parameters such as number of data gaps, multipath level, number of cycle slips, number of dual frequency observations with respect to the expected number, and from their combinations.

The EUREF Permanent GNSS Network Central Bureau (EPN CB, Bruyninx et al., 2019) is operationally collecting and analysing the quality of more than 300 GNSS stations and investigates the main reason of any quality degradation. EPN CB is currently operating a semi-automatic (followed by a manual) data-monitoring tool to detect the quality degradations and investigate the source of the problems. In the upcoming years, this data-monitoring tool will be used to also monitor the GNSS component of the European Plate Observing System (EPOS) expected to include more than 3000 GNSS stations. This anticipated inflation of GNSS stations to be monitored will make it increasingly challenging to select the high quality GNSS data. EPN CB’s current system requires time-consuming semi-automatic inspection of data quality and it is not designed to handle the larger amounts of data. In addition, the current system does not exploit correlations between the daily data quality, time series and the GNSS station metadata (such as equipment type and receiver firmware) often common to many stations.

In this poster, we will first present the currently used method of GNSS data quality checking and its limitations. Based on more than 20 years of GNSS observations collected in the EPN, we will show typical cases of correlations between the time series of data quality metrics and GNSS station metadata. Then, we will set up the requirements and design the new GNSS data quality monitoring system capable of handling more than 300 stations. Based on the collected EPN samples and the typical cases, we will introduce ongoing improvements taking advantage of artificial intelligence techniques, show the possible design of the neutral network, and present supervised training of the neutral network.

Bruyninx C., Legrand J., Fabian A., Pottiaux E. (2019) GNSS Metadata and Data Validation in the EUREF Permanent Network. GPS Sol., 23(4), https://doi: 10.1007/s10291-019-0880-9

How to cite: Fabian, A., Bruyninx, C., Legrand, J., and .Miglio, A.: GNSS data quality check in the EPN network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21489, https://doi.org/10.5194/egusphere-egu2020-21489, 2020

D2441 |
Jifeng Zhang, Bing Feng, and Dong Li

An artificial neural network, which is an important part of artificial intelligence, has been widely used to many fields such as information processing, automation and economy, and geophysical data processing as one of the efficient tools. However, the application in geophysical electromagnetic method is still relatively few. In this paper, BP neural network was combined with airborne transient electromagnetic method for imaging subsurface geological structures.

We developed an artificial neural network code to map the distribution of geologic conductivity in the subsurface for the airborne transient electromagnetic method. It avoids complex derivation of electromagnetic field formula and only requires input and transfer functions to obtain the quasi-resistivity image section. First, training sample set, which is airborne transient electromagnetic response of homogeneous half-space models with the different resistivity, is formed and network model parameters include the flight altitude and the time constant, which were taken as input variables of the network, and pseudo-resistivity are taken as output variables. Then, a double hidden layer BP neural network is established in accordance with the mapping relationship between quasi-resistivity and airborne transient electromagnetic response. By analyzing mean square error curve, the training termination criterion of BP neural network is presented. Next, the trained BP neural network is used to interpret the airborne transient electromagnetic responses of various typical layered geo-electric models, and it is compared with those of the all-time apparent resistivity algorithm. After a lot of tests, reasonable BP neural network parameters were selected, and the mapping from airborne TEM quasi-resistivity was realized. The results show that the resistivity imaging from BP neural network approach is much closer to the true resistivity of model, and the response to anomalous bodies is better than that of all-time apparent resistivity numerical method. Finally, this imaging technique was use to process the field data acquired by the airborne transient method from Huayangchuan area. Quasi-resistivity depth section calculated by BP neural network and all-time apparent resistivity is in good agreement with the actual geological situation, which further verifies the effectiveness and practicability of this algorithm.

How to cite: Zhang, J., Feng, B., and Li, D.: Resistivity-depth imaging of airborne transient electromagnetic method based on an artificial neural network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-1729, https://doi.org/10.5194/egusphere-egu2020-1729, 2019

D2442 |
Yu Li, Youyue Sun, Jinhui Jeanne Huang, and Edward McBean

With the increasingly prominent ecological and environmental problems in lakes, the monitoring water quality in lakes by satellite remote sensing is becoming more and more high demanding. Traditional water quality sampling is normally conducted manually and are time-consuming and labor-costly. It could not provide a full picture of the waterbodies over time due to limited sampling points and low sampling frequency. A novel attempt is proposed to use hyperspectral remote sensing in conjunction with machine learning technologies to retrieve water quality parameters and provide mapping for these parameters in a lake. The retrieval of both optically active parameters: Chlorophyll-a (CHLA) and dissolved oxygen concentration (DO), as well as non-optically active parameters: total phosphorous (TP), total nitrogen (TN), turbidity (TB), pH were studied in this research. A comparison of three machine learning algorithms including Random Forests (RF), Support Vector Regression (SVR) and Artificial Neural Networks were conducted. These water parameters collected by the Environment and Climate Change Canada agency for 20 years were used as the ground truth for model training and validation. Two set of remote sensing data from MODIS and Sentinel-2 were utilized and evaluated. This research proposed a new approach to retrieve both optically active parameters and non-optically active parameters for water body and provide new strategy for water quality monitoring.

How to cite: Li, Y., Sun, Y., Huang, J. J., and McBean, E.: Retrieval of Water Quality Parameters in Lake Ontario Based on Hyperspectral Remote Sensing Data and Intelligent Algorithms, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-1869, https://doi.org/10.5194/egusphere-egu2020-1869, 2019

D2443 |
| Highlight
Christoph Käding and Jakob Runge

Unveiling causal structures, i.e., distinguishing cause from effect, from observational data plays a key role in climate science as well as in other fields like medicine or economics. Hence, a number of approaches has been developed to approach this. Recent decades have seen methods like Granger causality or causal network learning algorithms, which are, however, not generally applicable in every scenario. When given two variables X and Y, it is still a challenging problem to decide whether X causes Y, or Y causes X. Recently, there has been progress in the framework of structural causal models, which enable the discovery of causal relationships by making use of functional dependencies (e.g., only linear) and noise models (e.g., only non-Gaussian noise). However, each of them is coming with its own requirements and constraints. While the corresponding conditions are usually unknown in real scenarios, it is quite hard to choose the right method for every application in general.

The goal of this work is to evaluate and to compare a number of state-of-the-art techniques in a joint benchmark. To do so, we employ synthetic data, where we can control for the dataset conditions precisely, and hence can give detailed reasoning about the resulting performance of the individual methods given their underlying assumptions. Further, we utilize real-world data to shed light on their capabilities in actual applications in a comparative manner. We concentrate on the case considering two uni-variate variables due to the large number of possible application scenarios. A profound study, comparing even the latest developments, is, to the best of our knowledge, so far not available in the literature.

How to cite: Käding, C. and Runge, J.: Comparing Causal Discovery Methods using Synthetic and Real Data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9269, https://doi.org/10.5194/egusphere-egu2020-9269, 2020

Chat time: Thursday, 7 May 2020, 14:00–15:45

Chairperson: P. Baumann, S. Fiore
D2444 |
Xavier-Andoni Tibau, Christian Reimers, Veronika Eyring, Joachim Denzler, Markus Reichstein, and Jakob Runge

We propose a spatiotemporal model system to evaluate methods of causal discovery. The use of causal discovery to improve our understanding of the spatiotemporal complex system Earth has become widespread in recent years (Runge et al., Nature Comm. 2019). A widespread application example are the complex teleconnections among major climate modes of variability. 

The challenges in estimating such causal teleconnection networks are given by (1) the requirement to reconstruct the climate modes from gridded climate fields (dimensionality reduction) and (2) by general challenges for causal discovery, for instance, high dimensionality and nonlinearity. Both challenges are currently being tackled independently. Both dimensionality reduction methods and causal discovery have made strong progress in recent years, but the interaction between the two has not yet been much tackled so far. Thanks to projects like CMIP a vast amount of climate data is available. In climate models climate modes of variability emerge as macroscale features and it is challenging to objectively benchmark both dimension reduction and causal discovery methods since there is no ground truth for such emergent properties. 

We propose a spatiotemporal model system that encodes causal relationships among well-defined modes of variability. The model can be thought of as an extension of vector-autoregressive models well-known in time series analysis. This model provides a framework for experimenting with causal discovery in large spatiotemporal models. For example, researchers can analyze how the performance of an algorithm is affected under different methods of dimensionality reduction and algorithms for causal discovery. Also challenging features such as non-stationarity and regime-dependence can be modelled and evaluated. Such a model will help the scientific community to improve methods of causal discovery for climate science.

Runge, J., S. Bathiany, E. Bollt, G. Camps-Valls, D. Coumou, E. Deyle, C. Glymour, M. Kretschmer, M. D. Mahecha, J. Muñoz-Marı́, E. H. van Nes, J. Peters, R. Quax, M. Reichstein, M. Scheffer, B. Schölkopf, P. Spirtes, G. Sugihara, J. Sun, K. Zhang, and J. Zscheischler (2019). Inferring causation from time series in earth system sciences. Nature Communications 10 (1), 2553.

How to cite: Tibau, X.-A., Reimers, C., Eyring, V., Denzler, J., Reichstein, M., and Runge, J.: Spatiotemporal model for benchmarking causal discovery algorithms, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9604, https://doi.org/10.5194/egusphere-egu2020-9604, 2020

D2445 |
Laurentiu Asimopolos, Alexandru Stanciu, Natalia-Silvia Asimopolos, Bogdan Balea, Andreea Dinu, and Adrian-Aristide Asimopolos

In this paper, we present the results obtained for the geomagnetic data acquired at the Surlari Observatory, located about 30 Km North of Bucharest - Romania. The observatory database contains records from the last seven solar cycles, with different sampling rates.

We used AR, MA, ARMA and ARIMA (AutoRegressive Integrated Moving Average) type models for time series forecasting and phenomenological extrapolation. ARIMA model is a generalization of an autoregressive moving average (ARMA) model, fitted to time series data to predict future points in the series

We made spectral analysis using Fourier Transform, that gives us a relevant picture of the frequency spectrum of the signal component, but without locating it in time, while the wavelet analysis provides us with information regarding the time of occurrence of these frequencies. 

Wavelet allows local analysis of magnetic field components through variable frequency windows. Windows with longer time intervals allow us to extract low-frequency information, medium-sized intervals of different sizes lead to medium-frequency information extraction, and very narrow windows highlight the high-frequencies or details of the analysed signals.

We extend the study of geomagnetic data analysis and predictive modelling by implementing a Long Short-Term Memory (LSTM) recurrent neural network that is capable of modelling long-term dependencies and is suitable for time series forecasting. This method includes a Gaussian process (GP) model in order to obtain probabilistic forecasts based on the LSTM outputs. 

The evaluation of the proposed hybrid model is conducted using the Receiver Operating Characteristic (ROC) Curve that provides a probabilistic forecast of geomagnetic storm events. 

In addition, reliability diagrams are provided in order to support the analysis of the probabilistic forecasting models.

The implementation of the solution for predicting certain geomagnetic parameters is implemented in the MATLAB language, using the Toolbox Deep Learning Toolbox, which provides a framework for the design and implementation of deep learning models.

Also, in addition to using the MATLAB environment, the solution can be accessed, modified, or improved in the Jupyter Notebook computing environment.

How to cite: Asimopolos, L., Stanciu, A., Asimopolos, N.-S., Balea, B., Dinu, A., and Asimopolos, A.-A.: Using AutoRegressive Integrated Moving Average and Gaussian Processes with LSTM neural networks to predict discrete geomagnetic signals, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-10385, https://doi.org/10.5194/egusphere-egu2020-10385, 2020

D2446 |
Alexandr Mansurov and Olga Majlingova

Linked data is a method for publishing structured data in a way that also expresses its semantics. This semantic description is implemented by the use of vocabularies, which are usually specified by the W3C as web standards. However, anyone can create and register their vocabulary and register it in an open catalogue like LOV.

There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations [1].

Given the dispersed nature of linked data, we want to infer relationships between Linked Open Data datasets based on their semantic description. In particular we are interested in geospatial relationships.

We show a generic approach for relationships in semantic data cubes using the same taxonomies, related dimensions, as well as through structured geographical datasets. Good results were achieved using structural geographical ontologies in combination with the generic approach for taxonomies.


[1]     Cyganiak, Reynolds, Tennison:  The RDF Data Cube Vocabulary, W3C Recommendation, 16 January 2014,

How to cite: Mansurov, A. and Majlingova, O.: Relationships in semantic data cubes, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-11799, https://doi.org/10.5194/egusphere-egu2020-11799, 2020

D2447 |
Sung Dae Kim and Sang Hwa Choi

A pilot machine learning(ML) program was developed to test ML technique for simulation of biochemical parameters at the coastal area in Korea. Temperature, chlorophyll, solar radiation, daylight time, humidity, nutrient data were collected as training dataset from the public domain and in-house projects of KIOST(Korea Institute of Ocean Science & Technology). Daily satellite chlorophyll data of MODIS(Moderate Resolution Imaging Spectroradiometer) and GOCI(Geostationary Ocean Color Imager) were retrieved from the public services. Daily SST(Sea Surface Temperature) data and ECMWF solar radiation data were retrieved from GHRSST service and Copernicus service. Meteorological observation data and marine observation data were collected from KMA (Korea Meteorological Agency) and KIOST. The output of marine biochemical numerical model of KIOST were also prepared to validate ML model. ML program was configured using LSTM network and TensorFlow. During the data processing process, some chlorophyll data were interpolated because there were many missing data exist in satellite dataset. ML training were conducted repeatedly under varying combinations of sequence length, learning rate, number of hidden layer and iterations. The 75% of training dataset were used for training and 25% were used for prediction. The maximum correlation between training data and predicted data was 0.995 in case that model output data were used as training dataset. When satellite data and observation data were used, correlations were around 0.55. Though the latter corelation is relatively low, the model simulated periodic variation well and some differences were found at peak values. It is thought that ML model can be applied for simulation of chlorophyll data if preparation of sufficient reliable observation data were possible.

How to cite: Kim, S. D. and Choi, S. H.: A test development of a data driven model to simulate chlorophyll data at Tongyeong bay in Korea, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13035, https://doi.org/10.5194/egusphere-egu2020-13035, 2020

D2448 |
Salomon Eliasson, Martin Raspaud, and Adam Dybbroe
As earth-observing (EO) satellite data volume is growing, servers struggle to keep up with the computational load needed to process even single segments of satellite data with reasonable performance. Pytroll is a suite of free and open-source python tools from which the Satpy package is made to easily and efficiently read, process, write EO satellite data. To obtain computational efficiency, Pytroll approaches the performance problem from multiple angles through optimized data processing, built-in support for out-of-memory computations (using the underlying Dask python library), and allowing distributed processing (using the Dask Distributed tools). In this work, we will show how large volumes of satellite data can be read, processed, resampled, and written swiftly and easily with the Pytroll/Satpy package, in a cluster environment. Specifically, examples of efficiently processing Sentinel-1, Sentinel-2, and Himawari/AHI data will be shown, along with performance figures.

How to cite: Eliasson, S., Raspaud, M., and Dybbroe, A.: ​​Distributed Earth-Observation satellite data processing with Pytroll/Satpy, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13133, https://doi.org/10.5194/egusphere-egu2020-13133, 2020

D2449 |
JingJing Liu and JianChao Liu

In recent years, China's unconventional oil and gas exploration and development has developed rapidly and has entered a strategic breakthrough period. At the same time, tight sandstone reservoirs have become a highlight of unconventional oil and gas development in the Ordos Basin in China due to its industrial and strategic value. As a digital representation of storage capacity, reservoir evaluation is a vital component of tight-oil exploration and development. Previous work on reservoir evaluation indicated that achieving satisfactory results is difficult because of reservoir heterogeneity and considerable risk of subjective or technical errors. In the data-driven era, this paper proposes a machine learning quantitative evaluation method for tight sandstone reservoirs based on K-means and random forests using high-pressure mercury-injection data. This method can not only provide new ideas for reservoir evaluation, but also be used for prediction and evaluation of other aspects in the field of oil and gas exploration and production, and then provide a more comprehensive parameter basis for “intelligent oil fields”. The results show that the reservoirs could be divided into three types, and the quantitative reservoir-evaluation criteria were established. This method has strong applicability, evident reservoir characteristics, and observable discrimination. The implications of these findings regarding ultra-low permeability and complex pore structures are practical.

How to cite: Liu, J. and Liu, J.: Machine learning technique for the quantitative evaluation of tight sandstone reservoirs using high-pressure mercury-injection data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17007, https://doi.org/10.5194/egusphere-egu2020-17007, 2020

D2450 |
Saed Asaly, Lee-Ad Gottlieb, and Yuval Reuveni

Ground and space-based remote sensing technology is one of the most useful tools for near-space environment studies and space weather research. During the last decade, a considerable amount of efforts in space weather research is being devoted for developing the ability to predict the exact time and location of space weather events such as solar flares and X-rays bursts. Despite the fact that most of the natural factors of such events can be modeled numerically, it is still a challenging task to produce accurate predications due to insufficient detailed and real‐time data. Hence, space weather scientists are trying to learn patterns of previous data distribution using data mining and machine learning (ML) tools in order to accurately predict future space weather events. Here, we present a new methodology based on support vector machines (SVM) approach applied with ionospheric Total Electron Content (TEC) data, derived from worldwide GPS geodetic receiver network that predict B, C, M and X-class solar flare events. Experimental results indicate that the proposed method has the ability to predict solar flare events of X and M-class with 80-94% and 78-93% accuracy, respectively. However, it does not succeed in producing similar promising results for the small-size C and B-class flares.

How to cite: Asaly, S., Gottlieb, L.-A., and Reuveni, Y.: Producing solar flare predictions using support vector machine (SVM) applied with ionospheric total electron content (TEC) global maps, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17319, https://doi.org/10.5194/egusphere-egu2020-17319, 2020

D2451 |
Massimiliano Iurcev, Paolo Diviacco, Simone Scardapane, and Federico Muciaccia

Exploration seismics is the branch of geophysics that aims to explore the underground using the propagation, reflection and refraction of elastic waves generated by artificial sources. Seismic signals cannot be straightforward read as geological layers and features but need to be interpreted by experienced analysts that contextualize the possible meaning of a signal with the geologic model under development. It goes without saying that interpretation is an activity that is biased by background and tacit knowledge, perceptive and even sociological factors. Applications of artificial intelligence in this field gained space especially within the oil Exploration and Production (E&P) industry while less has been done in the academic sector. The main target of the E&P is the detection of Direct Hydrocarbon Indicators (DHI) highlighted as anomalies in the attribute space using mainly Principal Component Analysis (PCA) and Self-Organizing Maps (SOM) methods. 
There are, however, seismic signals that can be detected in the image space that can be associated with specific geological feature. Among these we started to concentrate to the simplest forms such as  seismic diffractions that can be associated with faults. The diffractor has the property that it scatters energy in all directions and plots on a seismic section as an hyperbola. It can be hard to detect the diffraction hyperbola especially when the data are contaminated with noise or if data is not homogeneous such as when they are integrated from different teams, practices or vintage.
To overcome these difficulties, a large compilation of data has been gathered and submitted to experts in order to train a prediction system. Data have been gathered from the SDLS (Antarctic Seismic Data Library System), which is a geoportal maintained by INOGS, providing open access to a big collection of multichannel seismic reflection data collected south of 60°S. An interactive application (written in Processing for GUI, open-source and multi-platform requirements) allowed a pool of geophysical researchers to mark individually the hyperbolic features onto the seismic traces, by simple mouse dragging. Further processing in Python of the collected information, based on geometric algorithms, helped to build a rich training dataset, with about 10000 classified images.
In order to investigate a first proof-of-concept for this application, we leverage recent results in deep learning and neural networks to train a predictive model for the automatic detection of the presence of the hyperbola from the image. A convolutional neural network (CNN) is trained to map the small pictures extracted beforehand to a probability describing the eventual presence of a hyperbola. We explore different designs for the CNN, using several state-of-the-art guidelines for its architecture, regularization, and optimization. Furthermore, we augment in real-time the original dataset with noise and jittering to improve the overall performance. Using the trained CNN we built heatmaps over a set of testing images, highlighting the regions with high probability of containing a feature.

How to cite: Iurcev, M., Diviacco, P., Scardapane, S., and Muciaccia, F.: Recognition of marine seismic data features using convolutional neural networks, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17431, https://doi.org/10.5194/egusphere-egu2020-17431, 2020

D2452 |
Maxim Samarin, Monika Nagy-Huber, Lauren Zweifel, Katrin Meusburger, Christine Alewell, and Volker Roth

Understanding the occurrence of soil erosion phenomena is of vital importance for ecology and agriculture, especially under changing climate conditions. In Alpine grasslands, susceptibility to soil erosion is predominately due to the prevailing geological, morphological and climate conditions but is also affected by anthropogenic aspects such as agricultural land use. Climate change is expected to have a relevant impact on the driving factors of soil erosion like strong precipitation events and altered snow dynamics. In order to assess spatial and temporal changes of soil erosion phenomena and investigate possible reasons for their occurrence, large-scale methods to identify different soil erosion sites and quantify their extent are desirable.

In the field of remote sensing, one such semi-automatic method for (semantic) image segmentation is Object-based Image Analysis (OBIA), which makes use of spectral and spatial properties of image objects. In a recent study (Zweifel et al.), we successfully employed OBIA on high-resolution orthoimages (RGB spectral bands, 0.25 to 0.5 m pixel resolution) and derivatives of digital elevation models (DEM) of a study site in the Swiss Alps (Urseren Valley). The method provides high-quality segmentation results and an increasing trend of total area affected by soil erosion (+156 +/- 18%) is shown over a period from 2000 to 2016. However, using OBIA requires expert knowledge, manual adjustments, and is time-intensive in order to achieve satisfying segmentation results. In addition, the parameter settings of the method cannot be easily transferred from one image to another.

To allow for large-scale semantic segmentation of erosion sites, we make use of fully convolutional neural networks (CNNs). In recent years, CNNs proved to be very performant tools for a variety of image recognition tasks. While training CNNs might be more time demanding, predicting segmentations for new images and previously unseen regions is usually fast. For this study, we train a U-Net with high-quality segmentation masks provided by OBIA and DEM derivatives. The U-Net segmentation results are not only in good agreement with the OBIA results, but also a similar trend for the increase of total area affected by soil erosion is observed.

In order to have a natural understanding of what in the input is “relevant” for the segmentation result, we make use of methods which highlight different regions of the input image, thereby providing a visually interpretable result. We use different approaches to identify these relevant regions which are based on perturbation of the input image and relevance propagation of the output signal to the input image. While the former approach identifies the relevant regions by modifying the input image and considering the changes in the output, the latter approach tracks the dominant signal from the segmentation output back to the input image, highlighting the relevant regions. Although both approaches attempt to attain the same goal, differences in the relevant regions of the input images for the segmentation results can be observed.

Zweifel, L., Meusburger, K., and Alewell, C. Spatio-temporal pattern of soil degradation in a Swiss Alpine grassland catchment. Remote Sensing of Environment, 235, 2019.

How to cite: Samarin, M., Nagy-Huber, M., Zweifel, L., Meusburger, K., Alewell, C., and Roth, V.: Visual Understanding in Semantic Segmentation of Soil Erosion Sites in Swiss Alpine Grasslands , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17346, https://doi.org/10.5194/egusphere-egu2020-17346, 2020

D2453 |
Nguyen Ha Trang, Yago Diez, and Larry Lopez

The outbreak of fir bark beetles (Polygraphus proximus Blandford) in natural Abies Mariesii forest on Zao Mountain were reported in 2016. With the recent development of deep learning and drones, it is possible to automatically detect trees in both man-made and natural forests including damaged tree detection. However there are still some challenges in using deep learning and drones for sick tree detection in mountainous area that we want to address: (i) mixed forest structure with overlapping canopies, (ii) heterogeneous distribution of species in different sites, (iii) high slope of mountainous area and (iv) variation of mountainous climate condition. The current work can be summarized into three stages: data collection, data preparation and data processing. All the data were collected by DJI Mavic 2 pro at 60-70m flying height from the take off point with ground sampling distance (GSD) are ranging from1.23 cm to 2.54 cm depending on the slope of the sites. To prepare the data to be processed using a Convolutional Neural Network (CNN), all images were stitched together using Agisoft’s metashape software to create five orthomosaics of five study sites. Every site has different percentage of fir according to the change of elevation. We then manually annotated all the mosaics with GIMP to categorize all the forest cover into 6 classes: dead fir, sick fir, healthy fir, deciduous trees, grass and uncovered (pathway, building and soil). The mosaics are automatically divided into small patches with the assigned categories by our algorithm with first trial window size of 200 pixel x 200 pixel, which we temporally see can cover the medium fir trees. We will also try different window sizes and evaluate how this parameter affects results. The resulting patches were finally used as the input for CNN architecture to detect the damaged trees. The work is still on going and we expect to achieve the results with high classification accuracy in terms of deep learning algorithm allowing us to build maps regarding health status of all fir trees.


Keywords: Deep learning, CNN, drones, UAVs, tree detection, sick trees, insect damaged trees, forest


How to cite: Ha Trang, N., Diez, Y., and Lopez, L.: Insect Damaged Tree Detection with Drone Data and Deep Learning Technique, Case Study: Abies Mariesii Forest, Zao Mountain, Japan, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17917, https://doi.org/10.5194/egusphere-egu2020-17917, 2020

D2454 |
Kirill Grashchenkov, Mikhail Krinitskiy, Polina Verezemskaya, Natalia Tilinina, and Sergey Gulev

Polar Lows (PLs) are intense atmospheric vortices that form mostly over the ocean. Due to their strong impact on the deep ocean convection and also on engineering infrastructure, their accurate detection and tracking is a very important task that is demanded by industrial end-users as well as academic researchers of various fields. While there are a few PL detection algorithms, there are no examples of successful automatic PL tracking methods that would be applicable to satellite mosaics or other data, which would as reliably represent PLs as remote sensing products. The only reliable way for the tracking of PLs at the moment is the manual tracking which is highly time-consuming and requires exhaustive examination of source data by an expert.

At the same time, visual object tracking (VOT) is a well-known problem in computer vision. In our study, we present the novel method for the tracking of PLs in satellite mosaics based upon Deep Convolutional Neural Networks (DCNNs) of a specific architecture. Using the Southern Ocean Mesocyclones database gathered in the Shirshov Institute of Oceanology, we trained our model to perform the assignment task, which is an essential part of our tracking algorithm. As a proof of concept, we will present preliminary results of our approach for PL tracking for the summer period of 2004 in the Southern Ocean.

How to cite: Grashchenkov, K., Krinitskiy, M., Verezemskaya, P., Tilinina, N., and Gulev, S.: Tracking of mesoscale atmospheric phenomena in satellite mosaics using deep neural networks, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-18696, https://doi.org/10.5194/egusphere-egu2020-18696, 2020

D2455 |
| Highlight
Peter Baumann

Datacubes form an accepted cornerstone for analysis (and visualization) ready spatio-temporal data offerings. Beyond the multi-dimensional data structure, the paradigm also suggests rich services, abstracting away from the untractable zillions of files and products - actionable datacubes as established by Array Databases enable users to ask "any query, any time" without programming. The principle of location-transparent federations establishes a single, coherent information space.

The EarthServer federation is a large, growing data center network offering Petabytes of a critical variety, such as radar and optical satellite data, atmospheric data, elevation data, and thematic cubes like global sea ice. Around CODE-DE and DIASs an ecosystem of data has been established that is available to users as a single pool, in particular for efficient distributed data fusion irrespective of data location.

In our talk we present technology, services, and governance of this unique intercontinental line-up of data centers. A live demo will show dist
ributed datacube fusion.


How to cite: Baumann, P.: United in Variety: The EarthServer Datacube Federation, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-10849, https://doi.org/10.5194/egusphere-egu2020-10849, 2020

D2456 |
Yuelei Xu

As a transitional area between land and ocean system, coastal zone is a sensitive area of global change, which gathers 2/3 of the global population and wealth. Under the background of coastal urbanization and ecological civilization construction in China, more attention has been attached to develop the coastal zone economy efficiently with the strong interference of human activities. However, the deficiency of a suitable method to evaluate coastal ecological environment, affects the balance between utilization and protection in the coastal zone. This research compared habitat quality in the present with that in the future, and used this as the evaluation index of the impact of land use on coastal ecological security. The impact of land use transformation on natural wetlands and the quality of natural habitats has been calculated based on the coastal land use data since 1980 and the forecast land use in 2050, which under the scenario of RCP 4.5 carbon dioxide emission simulated by FLUS model artificial intelligence. The results show that in recent 20 years, there have been obvious reclamation activities in China's coastal areas, especially in Bohai Bay area, Yangtze river delta and Pearl River Delta. From 1990 to 2010, the reclamation expansion areas are 272.49 km2,270.09 km2 and 50.57 km2, respectively. With the development of economic transformation and ecological priority in the southeast coastal areas in recent years, the effect of habitat restoration will be remarkable by 2050, while habitat in Bohai Bay area and Pearl River Delta present an obvious degradation trend. These results, including the 30-metre-resolution habitat quality, can be used for reference for coastal ecological security maintenance and economic restructuring in different regions. This research will help to build the national ecological security evaluation system and formulate future policies for coastal ecological environment protection, and accelerate China's economic transformation.

How to cite: Xu, Y.: Based on Artificial Intelligence Simulation Study on the Impact of Land Use on Coastal Ecological Security in China's Coastal Zone, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-2275, https://doi.org/10.5194/egusphere-egu2020-2275, 2020

D2457 |
Qian Zhang, Dawei Li, Min Niu, and Zhenzhen Wu

Based on the locations and types of past oil and gas field, new discoveries can be predicted from the tectonic setting of the world’s oil and gas field. Geoscientists can characterize a field based on the dominant geological event that influenced the structure’s ability to trap and contain oil and gas in recoverable quantities. But in fact multiple factors affected the type of the oil and gas fields. In this paper, a data mining approach was used to integrated factors of field type. The factors are evaluated by the quantified field data. These data included general field data, location, well statistics, cumulative production data, reserves data and reservoir properties data. The method includes four steps. Firstly, a set of attributes are identified to describe the field characteristics. Secondly, the application of principal component analysis and categorical principal components analysis reduced redundant data and noise by representing the main data variances with a few vector components in a transformed coordinate space. Finally, clustering was done based on a proximity matrix between samples. Euclidean distance definitions were tested in order to build a meaningful cluster tree. By applying this method to the world’s oil and gas field data, we concluded that: (1) the world’s fields can be clusfied in six types according to the quantified field data; (2) over 20% of the world’s fields are clustered at top depth between 2000 and 2500 meters. (3)more attributes can be added to this clustering method, and the influence of the attributes can be evaluated.

How to cite: Zhang, Q., Li, D., Niu, M., and Wu, Z.: A data mining method to identify field type in global oil and gas field case study, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-3856, https://doi.org/10.5194/egusphere-egu2020-3856, 2020

D2458 |
Adrian S. Barfod* and Jakob Juul Larsen

Exploring and studying the earth system is becoming increasingly important as the slow depletion of natural resources ensues. An important data source is geophysical data, collected worldwide. After gathering data, it goes through vigorous quality control, pre-processing, and inverse modelling procedures. Such procedures often have manual components, and require a trained geophysicist who understands the data, in order to translate it into useful information regarding the earth system. The sheer amounts of geophysical data collected today makes manual approaches impractical. Therefore, automating as much of the workflow related to geophysical data as possible, would allow novel opportunities such as fully automated geophysical monitoring systems, real-time modeling during data collection, larger geophysical data sets, etc.

Machine learning has been proposed as a tool for automating workflows related to geophysical data. The field of machine learning encompasses multiple tools, which can be applied in a wide range of geophysical workflows, such as pre-processing, inverse modeling, data exploration etc.

We present a study where machine learning is applied to automate the time domain induced polarization geophysical workflow. Such induced polarization data requires pre-processing, which is manual in nature. One of the pre-processing steps is that a trained geophysicist inspects the data, and removes so-called non-geologic signals, i.e. noise, which does not represent geological variance. Specifically, a real-world case from Grindsted Denmark is presented. Here, a time domain induced polarization survey was conducted containing seven profiles. Two lines were manually processed and used for supervised training of an artificial neural network. The neural net then automatically processed the remaining profiles of the survey, with satisfactory results. Afterwards, the processed data was inverted, yielding the induced polarization parameters respective to the Cole-Cole model. We discuss the limitations and optimization steps related to training such a classification network.

How to cite: Barfod*, A. S. and Larsen, J. J.: Automating the pre-processing of time-domain induced polarization data using machine learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-6922, https://doi.org/10.5194/egusphere-egu2020-6922, 2020

D2459 |
Octavian Dumitru, Gottfried Schwarz, Dongyang Ao, Gabriel Dax, Vlad Andrei, Chandra Karmakar, and Mihai Datcu

During the last years, one could see a broad use of machine learning tools and applications. However, when we use these techniques for geophysical analyses, we must be sure that the obtained results are scientifically valid and allow us to derive quantitative outcomes that can be directly compared with other measurements.

Therefore, we set out to identify typical datasets that lend themselves well to geophysical data interpretation. To simplify this very general task, we concentrate in this contribution on multi-dimensional image data acquired by satellites with typical remote sensing instruments for Earth observation being used for the analysis for:

  • Atmospheric phenomena (cloud cover, cloud characteristics, smoke and plumes, strong winds, etc.)
  • Land cover and land use (open terrain, agriculture, forestry, settlements, buildings and streets, industrial and transportation facilities, mountains, etc.)
  • Sea and ocean surfaces (waves, currents, ships, icebergs, coastlines, etc.)
  • Ice and snow on land and water (ice fields, glaciers, etc.)
  • Image time series (dynamical phenomena, their occurrence and magnitude, mapping techniques)

Then we analyze important data characteristics for each type of instrument. One can see that most selected images are characterized by their type of imaging instrument (e.g., radar or optical images), their typical signal-to-noise figures, their preferred pixel sizes, their various spectral bands, etc.

As a third step, we select a number of established machine learning algorithms, available tools, software packages, required environments, published experiences, and specific caveats. The comparisons cover traditional “flat” as well as advanced “deep” techniques that have to be compared in detail before making any decision about their usefulness for geophysical applications. They range from simple thresholding to k-means, from multi-scale approaches to convolutional networks (with visible or hidden layers) and auto-encoders with sub-components from rectified linear units to adversarial networks.

Finally, we summarize our findings in several instrument / machine learning algorithm matrices (e.g., for active or passive instruments). These matrices also contain important features of the input data and their consequences, computational effort, attainable figures-of-merit, and necessary testing and verification steps (positive and negative examples). Typical examples are statistical similarities, characteristic scales, rotation invariance, target groupings, topic bagging and targeting (hashing) capabilities as well as local compression behavior.

How to cite: Dumitru, O., Schwarz, G., Ao, D., Dax, G., Andrei, V., Karmakar, C., and Datcu, M.: Selection of Reliable Machine Learning Algorithms for Geophysical Applications, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7586, https://doi.org/10.5194/egusphere-egu2020-7586, 2020

D2460 |
Eric Petermann, Hanna Meyer, Madlene Nussbaum, and Peter Bossew

The radioactive gas radon (Rn) is considered as an indoor air pollutant due to its detrimental effects on human health. Radon is known as the second most important cause for lung cancer after tobacco smoking. The dominant source of indoor Rn is the ground beneath the building in most cases. Following the European Basic Safety Standards, all EU Member States are required to delineate Rn priority areas, i.e. areas with increased risk of high indoor radon concentrations. One possibility to this end is using the “geogenic Rn potential” (GRP), which quantifies the availability of geogenic Rn for infiltration into buildings. The GRP is defined as a function of Rn concentration in soil gas and soil gas permeability.

In this study we used > 4,000 point measurements across Germany in combination with ~50 environmental co-variables (predictors). We fitted machine learning regression models to the target variables Rn concentration in soil and soil gas permeability. Subsequently, the GRP is calculated from both quantities. We compared the performance of three algorithms: Multivariate Adaptive Regression Splines (MARS), Random Forest (RF) and Support Vector Machines (SVM). Potential candidate predictors are geological, hydrogeological and soil landscape units, soil physical properties, soil chemical properties, soil hydraulic properties, climatic data, tectonic fault data, and geomorphological parameters.

The identification of informative predictors, tuning the model hyperparameters and estimation of the model performance was conducted using a spatial 10-fold cross-validation, where the folds were split by spatial blocks of 40*40 km. This procedure counteracts spatial autocorrelation of predictor and response data and is expected to ensure independence of training and test data. MARS, RF and SVM were evaluated in terms of its prediction accuracy and prediction variance. The results revealed that RF provided the most accurate predictions so far. The effect of the selected predictors on the final map was assessed in a quantitative way using partial dependence plots and spatial dependence maps. The RF model included 8 and 14 informative predictors for radon and permeability, respectively. The most important predictors in the RF model were geological and hydrogeological units as well as field capacity for radon and soil landscape, geological and hydrogeological units for soil gas permeability.

How to cite: Petermann, E., Meyer, H., Nussbaum, M., and Bossew, P.: Mapping the geogenic radon potential for Germany by machine learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-8501, https://doi.org/10.5194/egusphere-egu2020-8501, 2020

D2461 |
Jiyunting Sun, J.Pepijn Veefkind, Peter van Velthoven, and Pieternel.F Levelt

The environmental effects of absorbing aerosols are complex: they warm the surface and the atmosphere on a large scale, while locally they cool the surface. Absorbing aerosols also affect precipitation and cloud formation. A comprehensive understanding of aerosol absorption is important to reduce the uncertainties in aerosol radiative forcing assessments. The ultraviolet aerosol index (UVAI) is a qualitative measure of aerosol absorption provided by multiple satellite missions since 1978. UVAI is directly calculated by the difference between the measured spectral contrast and the simulated ones in the near-UV channel, without assumptions on aerosol properties. This long-term global daily data set is advantageous for many applications. In previous work, we have attempted to derive the single scattering albedo (SSA) from UVAI. In this work, we evaluate the UVAI derived from a chemistry transport model (CTM) with satellite observations. Conventionally, UVAI from a model aerosol fields at a satellite footprint is simulated using a radiative transfer model. In order to do this, one has to make assumptions on the spectral dependence of the aerosol optical properties. The lack of measurements and our poor knowledge of these properties may lead to large uncertainties in the simulated UVAI, and these uncertainties are difficult to quantify. In this work, we propose an alternative method, that is to simulate the UVAI based on Machine Learning (ML) approaches. A training data set is constructed by independent measurements and/or model simulations with strict quality controls. We simulate the UVAI using modelled aerosol properties, the Sun-satellite geometry and the surface parameters. The discrepancy between the retrieved UVAI and the ML predictions can help us to identify the unrealistic inputs of aerosol absorption in the model.

How to cite: Sun, J., Veefkind, J. P., van Velthoven, P., and Levelt, P. F.: Evaluating Modelled Aerosol Absorption by Simulating the UV Aerosol Index using Machine Learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-8878, https://doi.org/10.5194/egusphere-egu2020-8878, 2020

D2462 |
Alex Hamer, Daniel Simms, and Toby Waine

Accurate mapping of agricultural area is essential for Afghanistan’s annual opium poppy monitoring programme. Access to labelled data remains the main barrier for utilising deep learning from satellite imagery to automate the process of land cover classification. In this study, we aim to transfer knowledge from historical labelled data of agricultural land, from work on poppy cultivation estimates undertaken between 2007 and 2010, to classify imagery from a range of sensors using deep learning. Fully Convolutional Networks (FCNs) have been used to learn the complex features of agriculture in southern Afghanistan using their inherent spatial and spectral characteristics from satellite imagery. FCNs are trained and validated using labelled Disaster Monitoring Constellation (DMC) data (32 m) to transfer knowledge of agricultural land to classify other imagery, such as Landsat (30 m). The dependency on spatial and spectral characteristics are explored using intensity, Normalised Difference Vegetation Index (NDVI), top of atmosphere reflectance and tasselled cap transformation. The underlying spatial features associated with agriculture are found to play a significant role in agriculture discrimination. High classification performance has been achieved with over 92% overall accuracy and 0.58 intersection over union. The ability to transfer knowledge from historical datasets to new satellite sensors is an exciting prospect for future automated agricultural land discrimination in the United Nations Office on Drugs and Crime annual opium survey.

How to cite: Hamer, A., Simms, D., and Waine, T.: Using deep learning to transfer knowledge between satellite datasets for automated agricultural land discrimination in Afghanistan, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9243, https://doi.org/10.5194/egusphere-egu2020-9243, 2020

D2463 |
Zhen Cheng and Qiaofeng Guo

        Instruments based on light scattering used to measure total suspended particulate (TSP) concentrations have the advantages of fast response, small size and low cost as compared to the gravimetric reference method. However, the relationship between scattering intensity and TSP mass concentration varies nonlinearly with both environmental conditions and particle properties, making it difficult to make corrections. This study applied four machine learning models (support vector machine, random forest, gradient boosting regression trees and an artificial neural network) to correct scattering measurements for TSP mass concentrations. A total of 1141 hourly records of collocated gravimetric and light scattering measurements taken at 17 urban sites in Shanghai, China were used for model training and validation. All four machine learning models improved the linear regressions between scattering and gravimetric mass by increasing slopes from 0.4 to 0.9-1.1 and coefficients of determination from 0.1 to 0.8-0.9. Partial dependence plots indicate that TSP concentrations determined by light scattering instruments increased continuously in the PM2.5 concentration range of ~0-80 µg/m3; however, they leveled off above PM10 and TSP concentrations of ~60 and 200 µg/m3, respectively. The TSP mass concentrations determined by scattering showed an exponential growth after relative humidity exceeded 70%, in agreement with previous studies on hygroscopic growth of fine particles. This study demonstrates that machine learning models can effectively improve the correlation between light scattering measurements and TSP mass concentrations with filter-based methods. Interpretation analysis further provides scientific insights into the major factors (e.g., hygroscopic growth) that cause scattering measurements to deviate from TSP mass concentrations besides other factors like fluctuation of mass density and refractive index.

Figure 1. Comparison of TSP concentrations determined by light scattering and machine learning model outputs with those by gravimetric analyses. (a) LR: Linear Regression; (b) SVM: Support Vector Machine; (c) RF: Random Forest; (d) GBRT: Gradient Boosting Regression Tree; (e) ANN: Artificial Neural Network. y/x represents the slope, R2 is the coefficient of determination, N means the volume of the dataset.


How to cite: Cheng, Z. and Guo, Q.: Correction for the Measurements of Particulate Matter Sensors through Machine Learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-12943, https://doi.org/10.5194/egusphere-egu2020-12943, 2020

D2464 |
Big data analysis and achievements of global Petroleum exploration
shiyun Mi, Zhenzhen Wu, and Qian Zhang
D2465 |
Hao Zhang, Jianguang Han, Heng Zhang, and Yi Zhang

The seismic waves exhibit various types of attenuation while propagating through the subsurface, which is strongly related to the complexity of the earth. Anelasticity of the subsurface medium, which is quantified by the quality factor Q, causes dissipation of seismic energy. Attenuation distorts the phase of the seismic data and decays the higher frequencies in the data more than lower frequencies. Strong attenuation effect resulting from geology such as gas pocket is a notoriously challenging problem for high resolution imaging because it strongly reduces the amplitude and downgrade the imaging quality of deeper events. To compensate this attenuation effect, first we need to accurately estimate the attenuation model (Q). However, it is challenging to directly derive a laterally and vertically varying attenuation model in depth domain from the surface reflection seismic data. This research paper proposes a method to derive the anomalous Q model corresponding to strong attenuative media from marine reflection seismic data using a deep-learning approach, the convolutional neural network (CNN). We treat Q anomaly detection problem as a semantic segmentation task and train an encoder-decoder CNN (U-Net) to perform a pixel-by-pixel prediction on the seismic section to invert a pixel group belongs to different level of attenuation probability which can help to build up the attenuation model. The proposed method in this paper uses a volume of marine 3D reflection seismic data for network training and validation, which needs only a very small amount of data as the training set due to the feature of U-Net, a specific encoder-decoder CNN architecture in semantic segmentation task. Finally, in order to evaluate the attenuation model result predicted by the proposed method, we validate the predicted heterogeneous Q model using de-absorption pre-stack depth migration (Q-PSDM), a high-resolution depth imaging result with reasonable compensation is obtained.

How to cite: Zhang, H., Han, J., Zhang, H., and Zhang, Y.: Deep learning Q inversion from reflection seismic data with strong attenuation using an encoder-decoder convolutional neural network: an example from South China Sea, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-3809, https://doi.org/10.5194/egusphere-egu2020-3809, 2020

D2466 |
Thomas Rieutord and Sylvain Aubert

Atmospheric boundary layer height (BLH) is a key parameter for air quality forecast. To measure it, a common practice is to use aerosol lidars: a strong decrease in the backscatter signal indicates the top of the boundary layer. This work explains and compares two methods of machine learning to derive BLH from backscatter profiles: the K-means algorithm and the AdaBoost algorithm. As K-means is unsupervised, it has less dependency on instrument settings, hence more generalization skills. AdaBoost was used for binary classification: boundary layer/free atmosphere. It has been trained on 2 days, labelled by hand, therefore it has less generalization skills but a better representation of the diurnal cycle. Both methods are compared to the lidar manufacturer's software and to the BLH derived from collocated radiosondes. The radiosondes are taken as reference for all other methods. The comparison is carried out on a 2 years period (2017-2018) on 2 sites (Trappes and Brest). Data come from Meteo-France's operational network. The code and the data that produced these results will be put under a fully open access licence, with the name of KABL (K-means for Atmospheric Boundary Layer) and ADABL (AdaBoost for Atmospheric Boundary Layer).

How to cite: Rieutord, T. and Aubert, S.: Mixing height derivation from aerosol lidar using machine learning: KABL and ADABL algorithms, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-19807, https://doi.org/10.5194/egusphere-egu2020-19807, 2020

D2467 |
Juha Kangasluoma, Yusheng Wu, Runlong Cai, Joel Kuula, Hilkka Timonen, Pasi Aalto, Markku Kulmala, and Tuukka Petäjä

Supervised regression learning for predictions of aerosol particle size distributions from PM2.5, total particle number and meteorological parameters at Helsinki SMEAR3 station


J. Kangasluoma1, Y. Wu1, R. Cai1, J. Kuula2, H. Timonen2, P. P. Aalto1, M. Kulmala1, T. Petäjä1


1 Institute for Atmospheric and Earth System Research / Physics, Faculty of Science, University of Helsinki, Finland

2 Finnish Meteorological Institute, Erik Palménin aukio 1, 00560 Helsinki, Finland


Atmospheric particulate material is a significant pollutant and causes millions premature deaths yearly especially in urban city environments. To conduct epidemiological studies and quantify of the role of sub-micron particles, especially role of the ultrafine particles (<100 nm), in mortality caused by the particulate matter, long-term monitoring of the particle number, surface area, mass and chemical composition are needed. Such monitoring on large scale is currently done only for particulate mass, namely PM2.5 (mass of particulates smaller than 2.5 μm), while large body of evidence suggests that ultrafine particles, which dominate the number of the aerosol distribution, cause significant health effects that do not originate from particle mass.


The chicken-egg-problem here is that monitoring of particle number or surface area is not required from the authorities due to lack of epidemiological evidence showing the harm and suitable instrumentation (although car industry already voluntarily limits the ultrafine particle number emissions), while these epidemiological studies are lacking because of the suitable lack of data. Here we present the first step in solving this “lack of data issue” by predicting aerosol particle size distributions based on PM2.5, particle total number and meteorological measurements, from which particle size distribution, and subsequently number, surface area and mass exposure can be calculated.


We use baggedtree supervised regression learning (from Matlab toolbox) to train an algorithm with one full year data from SMEAR3 station at 10 min time resolution in Helsinki during 2018. The response variable is the particle size distribution (each bin separately) and the training variables are PM2.5, particle number and meteorological parameters. The trained algorithm is then used with the same training variables data, but from 2019 to predict size distributions, which are directly compared to the measured size distributions by a differential mobility particle sizer.


To check the model performance, we divide the predicted distributions to three size bins, 3-25, 25-100 and 100-1000 nm, and calculate the coefficient of determination (r2) between the measured and predicted number concentration at 10 min time resolution, which are 0.79, 0.60 and 0.50 respectively. We also calculate r2 between the measured and predicted number, surface area and mass exposure, which are 0.87, 0.79 and 0.74, respectively. Uncertainties in the prediction are mostly random, thus the r2 values will increase at longer averaging times.


Our results show that an algorithm that is trained with particle size distribution data, and particle number, PM2.5 and meteorological data can predict particle size distributions and number, surface area and mass exposures. In practice, these predictions can be realized e.g. in air pollution monitoring networks by implementing a condensation particle counter at each site, and circulating a differential mobility size spectrometer around the sites.



How to cite: Kangasluoma, J., Wu, Y., Cai, R., Kuula, J., Timonen, H., Aalto, P., Kulmala, M., and Petäjä, T.: Supervised regression learning for predictions of aerosol particle size distributions from PM2.5, total particle number and meteorological parameters at Helsinki SMEAR3 station, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13063, https://doi.org/10.5194/egusphere-egu2020-13063, 2020

D2468 |
Tiago G. Morais, Pedro Vilar, Marjan Jongen, Nuno R. Rodrigues, Ivo Gama, Tiago Domingos, and Ricardo F.M. Teixeira

In Portugal, beef cattle are commonly fed with a mixture of grazing and forages/concentrate feed. Sown biodiverse permanent pastures rich in legumes (SBP) were introduced to provide quality animal feed and offset concentrate consumption. SBP also sequester large amounts of carbon in soils. They use biodiversity to promote pasture productivity, supporting a more than doubling in sustainable stocking rate, with several potential environmental co-benefits besides carbon sequestration in soils.
Here, we develop and test the combination of remote sensing and machine learning approaches to predict the most relevant production parameters of plant and soil. For the plants, we included pasture yield, nitrogen and phosphorus content, and species composition (legumes, grasses and forbs). In the soil, we included soil organic matter content, as well as nitrogen and phosphorus content. For soils, hyperspectral data were obtained in the laboratory using previously collected soil samples (in near-infrared wavelengths). Remotely sensed multispectral data was acquired from the Sentinel-2 satellite. We also calculated several vegetation indexes. The machine learning algorithms used were artificial neural networks and random forests regressions. We used data collected in late winter/spring from 14 farms (more than 150 data samples) located in the Alentejo region, Portugal.
The models demonstrated a good prediction capacity with r-squared (r2) higher than in 0.70 for most of the variables and both spectral datasets. Estimation error decreases with proximity of the spectral data acquisition, i.e. error is lower using hyperspectral datasets than Sentinel-2 data. Further, results not shown systematic overestimation and/or underestimation. The fit is particularly accurate for yield and organic matter, higher than 0.80. Soil organic matter content has the lowest standard estimation error (3 g/kg soil – average SOM: 20 g/kg soil), while the legumes fraction has the highest estimation error (20% legumes fraction).
Results show that a move towards automated monitoring (combining proximal or remote sensing data and machine learning methods) can lead to expedited and low-cost methods for mapping and assessment of variables in sown biodiverse pastures.

How to cite: Morais, T. G., Vilar, P., Jongen, M., Rodrigues, N. R., Gama, I., Domingos, T., and Teixeira, R. F. M.: Characterizing sown biodiverse pastures using remote sensing data with machine learning, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-16142, https://doi.org/10.5194/egusphere-egu2020-16142, 2020

D2469 |
Rasmus Houborg and Giovanni Marchisio

Access to data is no longer a problem. The recent emergence of new observational paradigms combined with advances in conventional spaceborne sensing has resulted in a proliferation of satellite sensor data. This geospatial information revolution constitutes a game changer in the ability to derive time-critical and location-specific insights into dynamic land surface processes. 

However, it’s not easy to integrate all of the data that is available. Sensor interoperability issues and cross-calibration challenges present obstacles in realizing the full potential of these rich geospatial datasets.

The production of analysis ready, sensor-agnostic, and very high spatiotemporal resolution information feeds has an obvious role in advancing geospatial data analytics and machine learning applications at broad scales with potentially far reaching societal and economic benefits. 

At Planet, our mission is to make the world visible, accessible, and actionable. We are pioneering a methodology--the CubeSat-Enabled Spatio-Temporal Enhancement Method (CESTEM)--to enhance, harmonize, inter-calibrate, and fuse cross-sensor data streams leveraging rigorously calibrated ‘gold standard’ satellites (i.e., Sentinel, Landsat, MODIS) in synergy with superior resolution CubeSats from Planet. The result is next generation analysis ready data, delivering clean (i.e. free from clouds and shadows), gap-filled (i.e., daily, 3 m), temporally consistent, radiometrically robust, and sensor agnostic surface reflectance feeds featuring and synergizing inputs from both public and private sensor sources. The enhanced data readiness, interoperability, and resolution offer unique opportunities for advancing big data analytics and positioning remote sensing as a trustworthy source for delivering usable and actionable insights.

How to cite: Houborg, R. and Marchisio, G.: Advanced harmonization and sensor fusion to transform data readiness and resolution for big data analytics , EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13473, https://doi.org/10.5194/egusphere-egu2020-13473, 2020

D2470 |
Erik Bollen, Brianna R. Pagán, Bart Kuijpers, Stijn Van Hoey, Nele Desmet, Rik Hendrix, Jef Dams, and Piet Seuntjens

Monitoring, analysing and forecasting water-systems, such as rivers, lakes and seas, is an essential part of the tasks for an environmental agency or government. In the region of Flanders, in Belgium, different organisations have united to create the ”Internet of Water” (IoW). During this project, 2500 wireless water-quality sensors will be deployed in rivers, canals and lakes all over Flanders. This network of sensors will support a more accurate management of water systems by feeding real-time data. Applications include monitoring real-time water-flows, automated warnings and notifications to appropriate organisations, tracing pollution and the prediction of salinisation.

Despite the diversity of these applications, they mostly rely on a correct spatial representation and fast querying of the flow path: where does water flow to, where can the water come from, and when does the water pass at certain locations? In the specific case of Flanders, the human-influenced landscape provides additional complexity with rivers, channels, barriers and even cycles. Numerous models and systems exist that are able to answer the above questions, even very precisely, but they often lack the ability to produce the results quickly enough for real-time applicability that is required in the IoW. Moreover, the rigid data representation makes it impossible to integrate new data sources and data types, especially in the IoW, where the data originates from vastly different backgrounds.

In this research, we focus on the performance of spatio-temporal queries taking into account the spatial configuration of a strongly human-influenced water system and the real-time acquisition and processing of sensor data. The use of graph-database systems is compared with relational-database systems to store topologies and execute recursive path-tracing queries. Not only storing and querying are taken into account, but also the creation and updating of the topologies are an essential part. Moreover, the advantages of a hybrid approach that integrates the graph-based databases for spatial topologies with relational databases for temporal and water-system attributes are investigated. The fast querying of both upstream and downstream flow-path information is of great use in various applications (e.g., pollution tracking, alerting, relating sensor signals, …). By adding a wrapper library and creating a standardised result graph representation, the complexity is abstracted away from the individual applications.

How to cite: Bollen, E., R. Pagán, B., Kuijpers, B., Van Hoey, S., Desmet, N., Hendrix, R., Dams, J., and Seuntjens, P.: Design of database systems for optimized spatio-temporal querying to facilitate monitoring, analysing and forecasting in the "Internet of Water", EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-19171, https://doi.org/10.5194/egusphere-egu2020-19171, 2020

D2471 |
Valentín Kivachuk Burdá and Michaël Zamo

Any software relies on data, and the meteorological field is not an exception. The importance of using correct and accurate data is as important as using it efficiently. GRIB and NetCDF are the most popular file formats used in Meteorology, being able to store exactly the same data in any of them. However, they differ in how they internally treat the data, and transforming from GRIB (a simpler file format) to NetCDF is not enough to ensure the best efficiency for final applications.

In this study, we improved the performance and storage of ARPEGE cloud cover forecasts post-processing with convolutional neural network and Precipitation Nowcasting using Deep Neural Network projects (proposed in other sessions for the EGU general assembly). The data treatments of both projects were studied and different NetCDF capabilities were applied in order to obtain significantly faster execution times (up to 60 times faster) and more efficient space usage.

How to cite: Kivachuk Burdá, V. and Zamo, M.: NetCDF: Performance and Storage Optimization of Meteorological Data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-21549, https://doi.org/10.5194/egusphere-egu2020-21549, 2020

D2472 |
Jose E. Adsuara, Adrián Pérez-Suay, Alvaro Moreno-Martínez, Anna Mateo-Sanchis, Maria Piles, Guido Kraemer, Markus Reichstein, Miguel D. Mahecha, and Gustau Camps-Valls

Modeling and understanding the Earth system is of paramount relevance. Modeling the complex interactions among variables in both space and time is a constant and challenging endevour. When a clear mechanistic model of variable interaction and evolution is not available or uncertain, learning from data can be an alternative. 

Currently, Earth observation (EO) remote sensing data provides almost continuous space and time sampling of the Earth system which has been used to monitor our planet with advanced, semiautomatic algorithms able to classify and detect changes, and to retrieve relevant biogeophysical parameters of interest. Despite great advances in classification and regression, learning from data seems an ilusive problem in machine learning for the Earth sciences. The hardest part turns out to be the extraction of their relevant information and figuring out reliable models for summarizing, modeling, and understanding variables and parameters of interest.


We introduce the use of machine learning techniques to bring systems of ordinary differential equations (ODEs) to light purely from data. Learning ODEs from stochastic variables is a challenging problem, and hence studied scarcely in the literature. Sparse regression algorithms allow us to explore the space of solutions of ODEs from data. Owing to the Occam's razor, and exploiting extra physics-aware regularization, the presented method identifies the most expressive and simplest ODEs explaining the data. From the learned ODE, one not only learns the underlying dynamical equation governing the system, but standard analysis allows us to also infer collapse, turning points, and stability regions of the system. We illustrate the methodology using some particular remote sensing datasets quantifying biosphere and vegetation status. These analytical equations come to be self-explanatory models which may provide insight into these particular Earth Subsystems.

How to cite: Adsuara, J. E., Pérez-Suay, A., Moreno-Martínez, A., Mateo-Sanchis, A., Piles, M., Kraemer, G., Reichstein, M., Mahecha, M. D., and Camps-Valls, G.: Learning ordinary differential equations from remote sensing data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-19620, https://doi.org/10.5194/egusphere-egu2020-19620, 2020

D2473 |
Mateusz Norel, Krzysztof Krawiec, and Zbigniew Kundzewicz

Interpretation of flood hazard and its variability remains a major challenge for climatologists, hydrologists and water management experts. This study investigates the existence of links between variability in high river discharge, worldwide, and inter-annual and inter-decadal climate oscillation indices: El Niño-Southern Oscillation, North Atlantic Oscillation, Pacific Interdecadal Oscillation, and Atlantic Multidecadal Oscillation. Global river discharge data used here stem from the ERA-20CM-R reconstruction at 0.5 degrees resolution and form a multidimensional time series, with each observation being a spatial matrix of estimated discharge volume. Elements of matrices aligned spatially form time series which were used to induce dedicated predictive models using machine learning tools, including multivariate regression (e.g. ARMA) and recurrent neural networks (RNNs), in particular the Long Short Term Memory model (LSTM) that proved to be effective in many other application areas. The models are thoroughly tested and juxtaposed in hindcasting mode on a separate test set and scrutinized with respect to their statistical characteristics. We hope to be able to contribute to improvement of interpretation of variability of flood hazard and reduction of uncertainty.

How to cite: Norel, M., Krawiec, K., and Kundzewicz, Z.: Learning recurrent transfer functions from data: From climate variability to high river discharge, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13151, https://doi.org/10.5194/egusphere-egu2020-13151, 2020