Free and Open Source Software (FOSS), Cloud-based Technologies and HPC to Facilitate Collaborative Science

Earth science research has become increasingly collaborative through shared code and shared platforms. Researchers work together on data, software and algorithms to answer cutting-edge research questions. Teams also share these data and software with other collaborators to refine and improve these products. This work is supported by Free and Open Source Software (FOSS) and by shared virtual research infrastructures utilising cloud and high-performance computing.
Software is critical to the success of science. Creating and using FOSS enhances collaboration and innovation in the scientific community, creates a peer-reviewed and consensus-oriented environment, and promotes the sustainability of science infrastructures.
This session will showcase solutions and applications based on the Free and Open Source Software (FOSS), cloud-based architecture and high-performance computing to support information sharing, scientific collaboration, and solutions that enable large-scale data analytics at scale solutions.

Co-organized by GI2, co-sponsored by AGU
Convener: Jens Klump | Co-conveners: Paolo Diviacco, Kaylin BugbeeECSECS, Anusuriya Devaraju, Peter Löwe
| Mon, 23 May, 15:55–18:27 (CEST)
Room 0.31/32

Presentations: Mon, 23 May | Room 0.31/32

Chairpersons: Kaylin Bugbee, Paolo Diviacco, Peter Löwe
Metadata and Interoperability
On-site presentation
Lina Stein and Thorsten Wagener

The number of publications in the field of Hydrology (and in other geoscience fields) is rising at an almost exponential rate. In 2021 alone, more than 25 000 articles were listed in Web of Science on the topic of Water Resources. There is a tremendous wealth of knowledge and data hidden in these articles, which capture our experience in studying places, datasets or models. Hidden, because we currently do not possess (or at least, do not use) the necessary tools to access this knowledge resource in an effective manner. It is increasingly difficult for an individual researcher to build on existing knowledge. New ways to approach this problem are urgently needed.  

One approach to address this problem of literature explosion might be to extend article metadata to include geoscience-specific information that can facilitate knowledge search, accumulation and synthesis in a domain specific manner. Imagine one could easily find all studies performed in a specific location/ climate/ land use thus allowing a full picture of the hydrology of that region/ climate/ land use. It is important for any geoscience, a field strongly depending on experience, that knowledge is not “forgotten” in a mountain of publications but can easily be integrated into larger understanding.

So what meta-information would be most useful in knowledge synthesis? Study location? Spatial and/or temporal scale? Models used? Here, we would like to (re-)start the discussion on geoscience-relevant metadata enrichment. With the recent advancement in text mining scholarly literature, it is critical to have this discussion now or fall behind.

The Geosciences strongly depend on experiences we gain, which we largely share through the articles we publish. Knowledge accumulation in our science is hindered if this exchange of knowledge becomes ineffective. We are afraid it already has!

How to cite: Stein, L. and Wagener, T.: Knowledge hidden in plain sight – Extending article metadata to support meta-analysis and knowledge accumulation, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3590,, 2022.

On-site presentation
Piotr Zaborowski, Rob Atkinson, Nils Hempelmann, and Marie-Francoise Voidrot

The FAIR data principles form the core OGC mission that renders in the open geospatial standards and the open-data initiatives that use them. Although OGC is best known for the technical interoperability, the domain modelling and semantic level play an inevitable role in the standards definition and the exploitation. On the one hand, we have a growing number of specialised profiles and implementations that selectively use the OGC modular specification model components. On the other hand, various domain ontologies exist already, enabling a better understanding of the data. As there could be multiple semantic representations, common data models support cross ontology traverses. Defining the service in the technical-semantic space requires fixing some flexibility points, including optional and mandatory elements, additional constraints and rules, and content including normalised vocabularies to be used.

The proposed solution of the OGC Definition Server is a multi-purpose application built around the triple store database engine integrated with the ingestion, validation, and entailment tools and exposing customized end-points. The models are available in the human-readable format and machine-2-machine aimed encodings. For manual processes, it enables understanding the technical and semantic definitions/relationships between entities. Programmatic solutions benefit from a precise referential system, validations, and entailment.

Currently, OGC Definition Server is hosting several types of definitions covering:

  • Register of OGC bodies, assets, and its modules
  • Ontological common semantic models (e.g., for Agriculture)
  • Dictionaries of subject domains (e.g., PipelineML Codelists)

In practice, that is a step forward in defining the bridge between conceptual and logical models. The concepts can be expressed as instances of various ontological classes and interpreted within multiple contexts, with the definition translated into entities, relationships, and properties. In the future, it is linking the data to the reference model and external ontologies that may be even more significant. Doing so can greatly improve the quality of the knowledge produced based on the collected data. Ability to verify the research outcomes and explainable AI are just two examples where a precise log of inferences and unambiguous semantic compatibility of the data will play a key role.

How to cite: Zaborowski, P., Atkinson, R., Hempelmann, N., and Voidrot, M.-F.: Technical-semantic interoperability reference, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10032,, 2022.

On-site presentation
Christian Pichot, Nicolas Beudez, Cécile Callou, André Chanzy, Alyssa Clavreul, Philippe Clastre, Benjamin Jaillet, François Lafolie, Jean-François Le Galliard, Chloé Martin, Florent Massol, Damien Maurice, Nicolas Moitrier, Ghislaine Monet, Hélène Raynal, Antoine Schellenberger, and Rachid Yahiaoui

The study of ecosystem characteristics and functioning requires multidisciplinary approaches and mobilises multiple research teams. Data are collected or computed in large quantity but are most often poorly standardised and therefore heterogeneous. In this context the development of semantic interoperability is a major challenge for the sharing and reuse of these data. This objective is implemented within the framework of the AnaEE (Analysis and Experimentation on Ecosystems) Research Infrastructure dedicated to experimentation on ecosystems and biodiversity. A distributed Information System (IS) is developed, based on the semantic interoperability of its components using common vocabularies (AnaeeThes thesaurus and OBOE-based ontology extended for disciplinary needs) for modelling observations and their experimental context. The modelling covers the measured variables, the different components of the experimental context, from sensor and plot to network. It consists in the atomic decomposition of the observations, identifying the observed entities, their characteristics and qualification, naming standards and measurement units. This modelling allows the semantic annotation of relational databases and flat files for the production of graph databases. A first pipeline is developed for the automation of the annotation process and the production of the semantic data, annotation that may represent a huge conceptual and practical work without such automation. A second pipeline is devoted to the exploitation of these semantic data through the generation i) of standardized GeoDCAT and ISO metadata records and ii) of data files (NetCDF format) from selected perimeters (experimental sites, years, experimental factors, measured variables...). Carried out on all the data generated by the experimental platforms, this practice will produce semantically interoperable data that meets the linked opendata standards. The work carried out contributes to the development and use of semantic vocabularies within the ecology research community. The genericity of the tools make them usable in different contexts of ontologies and databases.

How to cite: Pichot, C., Beudez, N., Callou, C., Chanzy, A., Clavreul, A., Clastre, P., Jaillet, B., Lafolie, F., Le Galliard, J.-F., Martin, C., Massol, F., Maurice, D., Moitrier, N., Monet, G., Raynal, H., Schellenberger, A., and Yahiaoui, R.: Developing semantic interoperability in ecosystem studies: semantic modelling and annotation for FAIR data production, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10213,, 2022.

Virtual presentation
Armin Mehrabian, Irina Gerasimov, and Mohammad Khayat

As one of NASA's Science Mission Directorate data centers, the Goddard Earth Sciences Data and Information Services Center (GES-DISC) provides Earth science data, information, and services to the public. One of the objectives of our mission is to facilitate data discovery for users and systems that utilize our data. Metadata plays a very important role in data discovery. As a result, if a dataset is to be used efficiently, it needs to be enhanced with rich and comprehensive metadata. For example, most search engines rely on matching the search query with the indexed metadata in order to find relevant results. Here we present a tool that supports data custodians in the process of creating metadata by utilizing natural language processing (NLP).


Our approach involves combining several text corpora and training a semantic embedding. An embedding is a numerical representation of linguistic features that is aware of the semantics and context. The text corpora we use to train our embedding model contains publication abstracts, our data collections metadata, and ontologies. Our recommendations are based on keywords selected from the Global Change Master Directory (GCMD) and a collection of ontologies including SWEET and ENVO. GCMD offers a comprehensive collection of Earth Science vocabulary terms. This data lexicon enables data curators to easily search metadata and retrieve the data, services, and variables associated with each term. When a query is matched against various keywords in the GCMD branch, the probability of the query matching these keywords is calculated. A similarity score is then assigned to each of the branches of the GCMD, and each branch is sorted according to this similarity metric. In addition to unsupervised training, our approach has the advantage of being able to search for keyword recommendations of different sizes, ranging from sub-words to sentences and longer texts.

How to cite: Mehrabian, A., Gerasimov, I., and Khayat, M.: A Natural Language Processing-based Metadata Recommendation Tool for Earth Science Data, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10850,, 2022.

Free and Open Source Software
Presentation form not yet defined
Etor E. Lucio-Eceiza, Christopher Kadow, Martin Bergemann, Mahesh Ramadoss, Brian Lewis, Andrej Fast, Jens Grieger, Andy Richling, Ingo Kirchner, Uwe Ulbrich, Hannes Thiemann, and Thomas Ludwig

The complexity of the climate system calls for a combined approach of different knowledge areas. For that, increasingly larger projects need a coordinate effort that fosters an active collaboration between members. On the other hand, although the continuous improvement of numerical models and larger observational data availability provides researchers with a growing amount of data to analyze, the need for greater resources to host, access, and evaluate them efficiently through High Performance Computing (HPC) infrastructures is growing more than ever. Finally, the thriving emphasis on FAIR data principles [1] and the easy reproducibility of evaluation workflows also requires a framework that facilitates these tasks. Freva (Free Evaluation System Framework [2, 3]) is an efficient solution to handle customizable evaluation systems of large research projects, institutes or universities in the Earth system community [4-6] over the HPC environment and in a centralized manner.


Freva is a scientific software infrastructure for standardized data and analysis tools (plugins) that provides all its available features both in a shell and web environment. Written in python, is equipped with a standardized model database, an application-programming interface (API) and a history of evaluations, among others:

  • An implemented metadata system in SOLR with its own search tool allows scientists and their plugins to retrieve the required information from a centralized database. The databrowser interface satisfies the international standards provided by the Earth System Grid Federation (ESGF, e.g. [7]).
  • An API allows scientific developers to connect their plugins with the evaluation system independently of the programming language. The connected plugins are able to access from and integrate their results back to the database, allowing for a concatenation of plugins as well. This ecosystem increases the number of scientists involved in the studies, boosting the interchange of results and ideas. It also fosters an active collaboration between plugin developers.
  • The history and configuration sub-system stores every analysis performed with Freva in a MySQL database. Analysis configurations and results can be searched and shared among the scientists, offering transparency and reproducibility, and saving CPU hours, I/O, disk space and time.

Freva efficiently frames the interaction between different technologies thus improving the Earth system modeling science.


This framework has undergone major refactoring and restructuring of the core that will also be discussed. Among others:

  • Major core Python update (2.7 to 3.9).
  • Easier deployment and containerization of the framework via Docker.
  • More secure system configuration via Vault integration.
  • Direct Freva function calls via python client (e.g. for jupyter notebooks).
  • Improvements in the dataset incorporation.




[2] Kadow, C. et al. , 2021. Introduction to Freva – A Free Evaluation System Framework for Earth System Modeling. JORS.






How to cite: Lucio-Eceiza, E. E., Kadow, C., Bergemann, M., Ramadoss, M., Lewis, B., Fast, A., Grieger, J., Richling, A., Kirchner, I., Ulbrich, U., Thiemann, H., and Ludwig, T.: Freva, a software framework for the Earth System community. Overview and and new features., EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11270,, 2022.

Presentation form not yet defined
David R. Steward

An open-source framework is presented to support geoscientific investigations of flow, conduction, and wave propagation. The Analytic Element Method (AEM) provides nearly exact solutions to complicated boundary and interface problems, typically with 6-8 significant digits. Examples are presented for seepage of water through soil and aquifers including fractured flow, groundwater/surface water interactions through stream beds, and ecological interactions of plant water uptake. Related applications include waves near coastal features and propagation of tsunamis through bathymetric shoals. This presentation overviews the concise AEM representation from Steward (2020), "Analytic Element Method: Complex Interactions of Boundaries and Interfaces", where solutions discretize the domain into features, develop mathematical representations of interactions, and develop coupled systems of equations to solve boundary conditions.  The companion site at Oxford University Press contains a wide range of open-source solutions to these problems and related applications across the geosciences.

How to cite: Steward, D. R.: An open-source framework for nearly exact solutions to complex geoscience interactions (AEM), EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10575,, 2022.

Coffee break
Chairpersons: Paolo Diviacco, Kaylin Bugbee, Peter Löwe
Virtual presentation
Ayberk Uyanik

Conversion of dynamic bottom-hole temperatures (BHTs) into static ones and utilizing on the purpose of either calibration for basin modelling or drilling plan is a crucial step for hydrocarbon and geothermal exploration projects. However, records of temperature conversions might be ignored or might get lost from the archives due to various reasons, such as project team change, diversion of focus into other areas or simply deletion of data. Disappearance of previous studies does not only disrupt the geoscientific knowledge but also causes repetition for exploration geoscientists to start the time consuming BHT conversion process all over again.

NE Mediterranean Dashboard v1.0 provides a solution for the issue by benefiting from data science instruments of Python programming language. By implementing Plotly-Dash for the front-end, and PostgreSQL for the back end as the keeper of thermal records in datatables, this open-source project proposes a user-friendly web application displaying temperature, geothermal gradient and heat flow profiles in a dashboard style.

The application is consisted of three tabs. The Overview tab provides statistical information while 2D plots section allows users to interact with cross-plots demonstrating thermal conditions for all wells or a particular well selected by the user. It also compares the results of three different BHT conversion methods known as; Horner-plot method, AAPG correction and Harrison et al. (1983). The last tab, Map View, illustrates the temperature, geothermal gradient, and heat flow maps for every 500 meters from surface to 4.5 km depth. The maps reveal the effects of the regional tectonics and how it controls the subsurface thermal behaviour along the Cilicia and Latakia Basins dominating the NE Mediterranean region.

All maps and cross-plots are interactive, and their styles can be changed according to the user’s preferences. They can also be downloaded as images for possible use in scientific publishment and/or presentations. The same interface and visualisation style, accessed by username and password, can also provide consistency between all project workers.

The source code is available at Github repository with the link; and can efficiently be implemented for exploration projects in other regions.

How to cite: Uyanik, A.: An open-source web application displaying present-day subsurface thermal conditions of the NE Mediterranean region, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-810,, 2022.

Virtual presentation
Jenniffer Carolina Triana-Martinez, Jose A. Fernandez-Gallego, Oscar Barrero, Irene Borra-Serrano, Tom De Swaef, Peter Lootens, and Isabel Roldan-ruiz

For precision agriculture (PA) applications that use aerial platforms, researchers are likely to be interested in extracting, study and understanding biophysical and structural properties in a spatio-temporal manner by using remotely sensed imagery to infer variations of vegetation biomass and/or plant vigor, irrigation strategies, nutrient use efficiency, stress, disease identification, among others. This requires measuring spectral responses of the crop at specific wavelengths by using, for instance, Vegetation Indices (VI). However, for the analysis of this spectral response and its heterogeneity and spatial variability, a large amount of aerial imagery (data) must be collected and processed using a photogrammetry software. Data extraction is often performed in a Geographic Information System (GIS) software and then analyzed using (in general) statistical software. On the one hand, a GIS is used for the collection of resources to manipulate, analyze, and display all forms of geographically referenced information. In this regard, Quantum GIS (QGIS) is one of the most well-known open-source software used which provides an integration of geoprocessing tools from a variety of different software libraries. QGIS is widely used to obtain VI computations through the raster calculator, although, this computation is performed with band rasters manually provided by the user; one by one, which is time-consuming. On the other hand, QGIS provides a Python interface to efficiently exploit the capabilities of a GIS to create similar plugins, but this can be a non-trivial task. In this work, we developed a specific and QGIS independent semi-automatic tool called ViCTool (Vegetation index Computation Tool) as a free open-source software (FOSS) for large amount of data extraction to derive VIs from aerial raster images in a certain region of interest. This tool has the option of extracting several multispectral and RGB VIs employing Blue, Green, Red, NIR, LWIR, or Red edge bands. The user must provide the input folder path containing one or more raster band folders, the shapefile with the regions of Interests, an output path to store the output VI rasters, and the file containing the VI computations. ViCTool was developed using Python PyQT for designing the User Interface (UI) and Python GDAL for raster processing to simplify and speed up the process of calculating a large amount of data intuitively.

How to cite: Triana-Martinez, J. C., Fernandez-Gallego, J. A., Barrero, O., Borra-Serrano, I., De Swaef, T., Lootens, P., and Roldan-ruiz, I.: ViCTool: An open-source tool for vegetation indices computation of aerial raster images using python GDAL , EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8914,, 2022.

Data Infrastructures
Virtual presentation
Lesley Wyborn, Nigel Rees, Jens Klump, Ben Evans, Tim Rawling, and Kelsey Druken

The Australian 2030 Geophysics Collections Project seeks to make accessible online a selection of rawer, high-resolution versions of geophysics datasets that comply with the FAIR and CARE principles, and ensure they are suitable for programmatic access in HPC environments by future 2030 next-generation scalable, data-intensive computation (including AI and ML). The 2030 project is not about building systems for the infrastructures and stakeholder requirements of today, rather it is about positioning geophysical data collections to be capable of taking advantage of next generation technologies and computational infrastructures by 2030.

There are already many known knowns of 2030 computing: high end computational power will be at exascale and today’s emerging collaborative platforms will continue to evolve as a mix of HPC and cloud. Data volumes will be measured in Zettabytes (1021 bytes), which is about 10 times more than today. It will be mandatory for data access to be fully machine-to-machine as envisaged by the FAIR principles in 2016. Whereas we currently discuss Big Data Vs (volume, variety, value, velocity, veracity, etc), by 2030 the focus will be on Big Data Cs (community, capacity, confidence, consistency, clarity, crumbs, etc).

So often today’s research is undertaken on pre-canned, analysis-ready datasets (ARD) that are tuned towards the highest common denominator as determined by the data owner. However, increased computational power colocated with fast-access storage systems will mean that geophysicists will be able to work on less processed data levels and then transparently develop their own derivative products that are more tuned to the parameters of their particular use case. By 2030, as research teams analyse larger volumes of high-resolution data they will be able to see the quality of their algorithms quickly and there will be multiple versions of open software being used as researchers fine tune individual algorithms to suit their specific requirements. We will be capable of more precise solutions and in hazards space and other relevant areas, analytics will be done in faster-than-real-time. 

The known unknowns emerging are how we will preserve and make transparent any result from this diversity and flexibility with regards to the exact software used, the precise version of the data accessed, and the platforms utilised, etc. When we obtain a scientific ‘product’, how will we vouch for its fidelity and ensure it can be consistently replicated to establish trust? How do we preserve who funded what so that sponsors can see which investments have had the greatest impact and uptake? 

To have any confidence in any data product, we will need to have transparency throughout the whole scientific process. We need to start working now on more automated systems that capture provenance through successive levels of processing, including how it was produced and which dataset/dataset extract was used. But how do we do this in a scaleable, machine readable way?

And then there will be the unknown unknowns of 2030 computing. Time will progressively expose these to us in the next decade as the scale and speed at which collaborative research is undertaken increases.


How to cite: Wyborn, L., Rees, N., Klump, J., Evans, B., Rawling, T., and Druken, K.: The Known Knowns, the Known Unknowns and the Unknown Unknowns of Geophysics Data Processing in 2030 , EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11012,, 2022.

Virtual presentation
Bente Lilja Bye, Georgios Sylaios, Arne-Jørgen Berre, Simon Van Dam, and Vivian Kiousi

The ILIAD Digital Twin of the Ocean, a H2020 funded project, builds on the assets resulting from two decades of investments in policies and infrastructures for the blue economy and aims at establishing an interoperable, data-intensive, and cost-effective Digital Twin of the Ocean. It capitalizes on the explosion of new data provided by many different Earth observation sources, advanced computing infrastructures (cloud computing, HPC, Internet of Things, Big Data, social networking, and more) in an inclusive, virtual/augmented, and engaging fashion to address all Earth data challenges. It will contribute towards a sustainable ocean economy as defined by the Centre for the Fourth Industrial Revolution and the Ocean, a hub for global, multistakeholder co-operation.
The ILIAD Digital Twin of the Ocean will fuse a large volume of diverse data, in a semantically rich and data agnostic approach to enable simultaneous communication with real world systems and models. Ontologies and a standard style-layered descriptor will facilitate semantic information and intuitive discovery of underlying information and knowledge to provide a seamless experience. The combination of geovisualisation, immersive visualization and virtual or augmented reality allows users to explore, synthesize, present, and analyze the underlying geospatial data in an interactive manner. To realize its potential, the ILIAD Digital Twin of the Ocean will follow the System of Systems approach, integrating the plethora of existing EU Earth Observing and Modelling Digital Infrastructures and Facilities. To promote additional applications through ILIAD Digital Twin of the Ocean, the partners will create the ILIAD Marketplace, included a market for Geosciences related applications and services. Like an app store, providers will use the ILIAD Marketplace to distribute apps, plug-ins, interfaces, raw data, citizen science data, synthesized information, and value-adding services derived from the ILIAD Digital Twin of the Ocean. It will also be an efficient way for scientists to discover and find relevant applications and services.

How to cite: Bye, B. L., Sylaios, G., Berre, A.-J., Van Dam, S., and Kiousi, V.: Digital Twin of the Ocean - An Introduction to the ILIAD project, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-12617,, 2022.

Presentation form not yet defined
Mohan Ramamurthy and Julien Chastang

Unidata has developed and deployed data infrastructure and data-proximate scientific workflows and software tools using cloud computing technologies for accessing, analyzing, and visualizing geoscience data. These resources are provided to educators and researchers through the Unidata Science Gateway ( and deployed on the U. S. National Science Foundation funded Jetstream ( cloud facility. During the SARS-CoV-2/COVID-19 pandemic, the Unidata Science Gateway has been used by many universities to teach data-centric atmospheric science courses and conduct several software training workshops to advance skills in data science.

The COVID-19 pandemic led to the closure of university campuses with little advance notice. Educators at institutions of higher learning had to urgently transition from in-person teaching to online classrooms. While such a sudden change was disruptive for education, it also presented an opportunity to experiment with instructional technologies that have been emerging for the last few years. Web-based computational notebooks, with their mixture of explanatory text, equations, diagrams and interactive code are an effective tool for online learning. Their use is prevalent in many disciplines including the geosciences. Multi-user computational notebook servers (e.g., Jupyter Notebooks) enable specialists to deploy pre-configured scientific computing environments for the benefit of students. The use such tools and environments removes barriers for students who otherwise have to download and install complex software tools that can be time consuming to configure, simplifying workflows and reducing time to analysis and interpretation. It also provides a consistent computing environment for all students and democratizes access to resources. These servers can be provisioned with computational resources not found in a desktop computing setting and leverage cloud computing environments and high speed networks. They can be accessed from any web browser-enabled device like laptops and tablets.

Since spring 2020 when the Covid pandemic led to the closure of universities across the U. S., Unidata has assisted several earth science departments with computational notebook environments for their classes. We worked with educators to tailor these resources for their teaching objectives. We ensured the technology was correctly provisioned with appropriate computational resources and collaborated to have teaching material immediately available for students. There were many successful examples of online learning experiences.

In this paper, we describe the details of the Unidata Science Gateway resources and discuss how those resources enabled Unidata to support universities during the COVID-19 lockdown.

How to cite: Ramamurthy, M. and Chastang, J.: The use of the Unidata Science Gateway as a cyberinfrastructure resource to facilitate education and research during COVID-19, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10615,, 2022.

Cloud Platforms
Virtual presentation
Maria Salama, Gordon Blair, Mike Brown, and Michael Hollaway

Research in environmental data science is typically transdisciplinary in nature, with scientists, practitioners, and stakeholders creating data-driven solutions to the environment’s grand challenges, often using a large amount of highly heterogeneous data along with complex analytical methods. The concept of virtual labs allow collaborating scientists to explore big data, develop and share new methods, as well as communicate their results to stakeholders, practitioners, and decision-makers across different scales (individual, local, regional, or national).

Within the Data Science of the Natural Environment (DSNE) project, a transdisciplinary team of environmental scientists, statisticians, computer scientists and social scientists are collaborating to develop statistical/data science algorithms for environmental grand challenges through the medium of a virtual labs platform, named DataLabs. DataLabs, in continuous development by UKCEH in an agile approach, is a consistent and coherent cloud-based research environment that advocates open and collaborative science by providing the infrastructure and software tools to bring users of different areas of expertise (scientists, stakeholders, policy-makers, and the public) interested in environmental science into one virtual space to tackle environmental problems. DataLabs support end-to-end analysis from the assimilation and analysis of data through to the visualisation, interpretation, and discussion of the results.

DataLabs draw on existing technologies to provide a range of functionality and modern tools to support research collaboration, including: (i) parallel data cluster services, such as DASK and Spark; (ii) executable notebook technologies, such as Jupyter, Zepplin and R; (iii) lightweight applications such as RShiny to allow rapid collaboration among diverse research teams; and (iv) containerisation of application deployment (e.g. using Docker) so that technologies developed can be more easily moved to other cloud platforms as required. Following the principles of service-oriented architectures, the design enables selecting the most appropriate technology for each component and exposing any functions by other systems via HTTP as services. Within each component, a modular-layered architecture is used to ensure separation of concerns and separated presentation. DataLabs are using JASMIN as the host computing platform, giving researchers seamless access to HPC resources, while taking advantage of the cloud scalability. Data storage is available to all systems through shared block storage (NFS cluster) and object storage (QuoBye S3).

Research into and development of virtual labs for environmental data science are taking part within the DSNE project. This requires studying the current experiences, barriers and opportunities associated with virtual labs, as well as the requirements for future developments and extensions. For this purpose, we have conducted an online user engagement survey, targeting DSNE researchers and the wider user community, as well as the international research groups and organisations that contribute to virtual labs design. The survey results are considered are feeding into the continuous development of DataLabs. For instance, some of the researchers’ requirements include the ability to submit their own containers to DataLabs and the security issues to access external data storage. Other users have indicated the importance of having libraries of data science and data visualisation methods, which are currently being populated by DSNE researchers to be then explored in different environmental problems. 

How to cite: Salama, M., Blair, G., Brown, M., and Hollaway, M.: Virtual Labs for Collaborative Environmental Data Science, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10719,, 2022.

Virtual presentation
Fabrizio Antonio, Donatello Elia, Andrea Giannotta, Alessandra Nuzzo, Guillaume Levavasseur, Atef Ben Nasser, Paola Nassisi, Alessandro D'Anca, Sandro Fiore, Sylvie Joussaume, and Giovanni Aloisio

The scientific discovery process has been deeply influenced by the data deluge started at the beginning of this century. This has caused a profound transformation in several scientific domains which are now moving towards much more collaborative processes. 

In the climate sciences domain, the ENES Data Space aims to provide an open, scalable, cloud-enabled data science environment for climate data analysis. It represents a collaborative research environment, deployed on top of the EGI federated cloud infrastructure, specifically designed to address the needs of the ENES community. The service, developed in the context of the EGI-ACE project, provides ready-to-use compute resources and datasets, as well as a rich ecosystem of open source Python modules and community-based tools (e.g., CDO, Ophidia, Xarray, Cartopy, etc.), all made available through the user-friendly Jupyter interface. 

In particular, the ENES Data Space provides access to a multi-terabyte set of specific variable-centric collections from large community experiments to support researchers in climate model data analysis experiments. The data pool of the ENES Data Space consists of a mirrored subset of CMIP datasets from the ESGF federated data archive collected by using the Synda community tool in order to provide the most up to date datasets into a single location. Results and output products as well as experiment definitions (in the form of Jupyter Notebooks) can be easily shared among users through data sharing services, which are also being integrated in the infrastructure, such as EGI DataHub.

The service was opened in the second part of 2021 and is now accessible in the European Open Science Cloud (EOSC) through the EOSC Portal Marketplace ( This contribution will present an overview of the ENES Data Space service and its main features.

How to cite: Antonio, F., Elia, D., Giannotta, A., Nuzzo, A., Levavasseur, G., Ben Nasser, A., Nassisi, P., D'Anca, A., Fiore, S., Joussaume, S., and Aloisio, G.: ENES Data Space: an open, cloud-enabled data science environment for climate analysis, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7330,, 2022.

On-site presentation
Benjamin Schumacher, Patrick Griffiths, Edzer Pebesma, Jeroen Dries, Alexander Jacob, Daniel Thiex, Matthias Mohr, and Christian Briese

The growing data stream from Earth Observation (EO) satellites has advanced scientific knowledge about the environmental status of planet earth and has enabled detailed environmental monitoring services. The openEO API developed in the Horizon 2020 project openEO (2017–2020, see demonstrated that large-scale EO data processing needs can be expressed as a common set of analytic operators which are implemented in many GIS software or image analysis software products. The openEO Platform service implements the API into an operational, federated service currently running at back-ends at EODC and VITO with access to SentinelHub data to meet processing needs of a wide user community.

openEO Platform ( enables users to access a large collection of open EO data and perform scientific computations with intuitive client libraries simplifying underlying complexity. The platform is currently under construction with a strong focus on user co-creation and input from various disciplines incorporating a range of use-cases and a free-of-charge Early Adopter program that allows users to test the platform and to directly communicate with its developers. The use cases include CARD4L compliant ARD data creation with user defined parameterisation, forest dynamics mapping including time series fitting and prediction functionalities, crop type mapping including EO feature engineering supporting machine learning based crop mapping and forest canopy mapping supporting regression based fraction cover mapping.

The interaction with the platform includes multiple programming interfaces (R, Python, JavaScript) and a browser-based management console and model builder which allows a direct, interactive display and modification of processing workflows. The resulting processing graph is then forwarded via the openEO API to the federated back-ends.

In the future users will be able to process continental-scale EO data and create ready-to-use environmental monitoring services with analysis-ready data (ARD) and predefined available processes. This presentation will provide an overview of the current capabilities and the evolution roadmap of openEO Platform. It will demonstrate the utility of the platform to process large amounts of EO data into meaningful information products, supporting environmental monitoring, scientific research and political decision-makers.

How to cite: Schumacher, B., Griffiths, P., Pebesma, E., Dries, J., Jacob, A., Thiex, D., Mohr, M., and Briese, C.: openEO Platform: Enabling analysis of large-scale Earth Observation data repositories with federated computational infrastructure , EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9101,, 2022.

Virtual presentation
Pavel Golodoniuc, Vincent Fazio, Samuel Bradley, YunLong Li, and Jens Klump

The AuScope Virtual Research Environment (AVRE) program’s Engage activity was devised as a vehicle to promote low-barrier collaboration projects with Australian universities and publicly-funded research agencies and to provide an avenue for exploring new applications and technologies that could become part of the broader AuScope AVRE portfolio. In its second year, we developed two projects with another cohort of collaborative projects proponents from two Australian research institutions. Both projects have leveraged and extended upon previously developed open-source projects while tailoring them to clients’ specific needs.

The latest projects developed under the AuScope AVRE Engage program were the AuScope Geochemistry Network (AGN) Lab Finder Application and the Magnetic Component Symmetry (MCS) Analysis application. The Lab Finder application fits within a broader ecosystem of AGN projects and is an online tool that provides an overview of participating laboratories, their equipment, techniques, contact information with a catalogue that sums up the possibilities of each analytical technique, and a user-friendly search and browsing interface. The MCS Analysis application implements the CSIRO Orthogonal Magnetic Component (OMC) analysis method for the detection of l variations in the magnetic field (i.e., anomalies) that are the result of subsurface magnetizations. Both applications were developed using free and open-source software (FOSS) and leveraged prior work and further expand on it. The AGN Lab Finder is an adaptation of the Technique Finder originally developed by Intersect for Microscopy Australia, which was redesigned to accommodate geochemistry-specific equipment and describe its analytical capabilities It provides an indexing mechanism and a search functionality allowing researchers to efficiently locate and identify laboratories with the equipment necessary to their research needs and that satisfies their analytical capability requirements. The MCS Analysis application is a derivative product based on Geophysical Processing Toolkit (GPT) that implements a user-centred approach to visual data analytics and modelling. It significantly improves user experience by integrating with open data services, adding complex interactivity and data visualisation functionality, and improving overall exploratory data analysis capability.

The Engage approach to running collaborative projects has proved successful over the last two years and produced low-maintenance tools that are made freely accessible to researchers. The approach to engage a wider audience and improve the speed of science delivery has influenced other projects within the CSIRO Mineral Resources business unit to implement similar programs.

This case study will demonstrate the social aspects of our experience in cross-institutional collaboration, showcase our learnings during the development of pilot projects, and outline our vision for future work.

How to cite: Golodoniuc, P., Fazio, V., Bradley, S., Li, Y., and Klump, J.: Cross-institutional collaboration through the prism of FOSS and Cloud technologies, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10855,, 2022.

Virtual presentation
Ulrich Kelka, Chris Peters, Owen Kaluza, Jens Klump, Steven Micklethwaite, and Nathan Reid

The mapping of fracture networks from aerial photographs, tracing of fault scarps in digital elevation models, and digitisation of boundaries from potential field data is fundamental to many geological applications (e.g. resource management, natural hazard assessment, geotechnical stability etc.). However, conventional approaches to digitising geological features are labour intensive and do not scale.

We describe how we designed an automated fracture detection workflow and implemented it in a cloud environment, using free and open-source software, as part of The Australian Scalable Drone Cloud (ASDC, national initiative. The ASDC aims to standardise and scale drone data, then analyse and translate it for users in academia, government, and industry.

In this use case, we applied automatic ridge/edge detection techniques to generate trace maps of discontinuities (e.g. fractures or lineaments). The approach allows for internal classification based on statistical description and/or geometry and enhances the understanding of the internal structure of such networks. Further, photogrammetry and image analysis at scale can be limited by the available computing resources, but this issue was overcome through implementation in the cloud. The simple methods l serve as a basis for emerging techniques that utilise machine learning to fully automate the discontinuity identification and represents an important step in the cultural adoption of such tools in the Earth Science community.

We deployed Open Drone Map (ODM) onto a cloud infrastructure to produce orthophoto mosaics from aerial images taken by UAV to implement this case study. We ported a fracture detection and mapping algorithm from Matlab to Python for the image analysis. The image analysis workflow is orchestrated through a Jupyter Notebook on a Jupyter Hub. The resulting prototype workflow will be used to better scope the services needed to manage the ASDC platform, like user management and data logistics.

How to cite: Kelka, U., Peters, C., Kaluza, O., Klump, J., Micklethwaite, S., and Reid, N.: UAV Data Analysis in the Cloud - A Case Study, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10796,, 2022.