Session ESSI2.2

[Programme]

ESSI2.2 | Data Spaces: Battling Environmental and Earth Science Challenges with Floods of Data

Orals |

Wed, 08:30

Posters on site

Thu, 16:15

Data Spaces: Battling Environmental and Earth Science Challenges with Floods of Data

Co-organized by GI2

Convener: Magdalena Brus | Co-conveners: Kaori OtsuECSECS, Paolo Mazzetti, Lesley Wyborn, Francesca Piatto

Orals

| Wed, 26 Apr, 08:30–10:10 (CEST)

Room 0.51

Posters on site

| Attendance Thu, 27 Apr, 16:15–18:00 (CEST)

Hall X4

Orals: Wed, 26 Apr | Room 0.51

Chairpersons: Lesley Wyborn, Kaori Otsu, Magdalena Brus

08:30–08:40

EGU23-8788

Highlight

On-site presentation

Towards the European Green Deal Data Space

Marta Gutierrez David, Mark Dietrich, Nevena Raczko, Sebastien Denvil, Mattia Santoro, Charis Chatzikyriakou, and Weronika Borejko

The European Commission has a program to accelerate the Digital Transition and is putting forward a vision based on cloud, common European Data Spaces and AI. As the data space paradigm unfolds across Europe, the Green Deal Data Space emerges. Its foundational pillars are to be built by the GREAT project.

Data Spaces will be built over federated data infrastructures with common technical requirements (where possible) taking into account existing data sharing initiatives. Services and middleware developed to enable a federation of cloud-to-edge capacities will be at the disposal of all data spaces.

GREAT, the Green Deal Data Space Foundation and its Community of Practice, has the ambitious goal of defining how data with the potential to help combat climate and environmental related challenges, in line with the European Green Deal, can be shared more broadly among many stakeholders, sectors and boundaries, according to European values such as fair access, privacy and security.

The project will consider and incorporate community defined requirements and use case analyses to ensure that the resulting data space infrastructure is designed and built with and for the users.

An implementation roadmap will guide the efforts of multiple actors to converge toward a blueprint technical architecture, a data governance scheme that enables innovative business cases, and an inventory of high value datasets, that will enable proof of concept, implementation and scale-up of a minimum viable green deal data space.This roadmap will identify the resources and other key ingredients needed for the Green Deal Data Space to be successful. Data sharing by design and data sovereignty are some of the main principles that will apply from the early stages ensuring cost effective and sustainable infrastructures that will drive Europe towards a single data market and green economic growth.

This talk will present how to engage with the project, the design methodology, progress towards the roadmap for deployment and the collaborative approach to building data spaces in conjunction with all the sectoral data spaces and the Data Space Support Centre.

How to cite: Gutierrez David, M., Dietrich, M., Raczko, N., Denvil, S., Santoro, M., Chatzikyriakou, C., and Borejko, W.: Towards the European Green Deal Data Space, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8788, https://doi.org/10.5194/egusphere-egu23-8788, 2023.

08:40–08:50

EGU23-5936

On-site presentation

What does the European Spatial Data Infrastructure INSPIRE need in order to become a Green Deal Data Space?

Joan Masó, Alba Brobia, Ivette Serral, Ingo Simonis, Francesca Noardo, Lucy Bastin, Carlos Cob Parro, Joaquín García, Raul Palma, and Sébastien Ziegler

In May 2007, the INSPIRE directive established the path towards creating the European Spatial Data Infrastructure (ESDI). While the Joint Research Centre (JRC) defined a set of detailed implementation guidelines, the European member states determined the agencies responsible for delivering the different topics specified in the directive’s annexes. INSPIRE’s goal was - and still is - to organize and share Europe’s data supporting environmental policies and actions. However, the way that INSPIRE was defined limited contributions to the public sector, and limited topics to those specifically listed in its annexes. Technical challenges and a lack of appropriate tools have impeded INSPIRE from implementing its own guidelines, and even after 15 years, the dream of a continuous, consistent description of Europe’s environment has still not completely materialized. We should apply the lessons learnt in INSPIRE when we build the Green Deal Data Space (GDDS). To create the GDDS, we should start with ESDI (the European Spatial Data Infrastructure), but also engage and align with the ongoing preparatory actions for data spaces (e.g., for green deal and agriculture) as well as include actors and networks that have emerged or been organized in the recent years. These include: networks of in situ observations (e.g. the Environmental Research Infrastructures (ENVRI) community); Citizen Science initiatives (such as the biodiversity observations integrated in the Global Biodiversity Information Facility (GBIF), or sensor communities for e.g. air quality); predictive algorithms and machine learning models and simulations based on artificial intelligence (such as the ones deployed in the European Open Science Cloud, International Data Space Association and Gaia-X; services driven both by the scientific community and the private sector); remote sensing derived products developed by the Copernicus Services. Most of these data providers have already embraced the FAIR principles and open data, providing many examples of best practice which can assist newer adopters on the path to open science. In the Horizon Europe project AD4GD (AllData4GreenDeal), we believe that, instead of trying to force data producers to adopt cumbersome new protocols, we should take advantage of the latest developments in geospatial standards and APIs. These allow loosely coupled but well documented and interlinked data sources and models in the GDDS while achieving scientifically robust integration and easy access to data in the resulting workflows. Another fundamental element will be the adoption of a common and extensible information model enabling the representation and exchange of Green Deal related data in an unambiguous manner, including vocabularies for Essential Variables to organize the observable measurements and increase the level of semantic interoperability. This will allow systems and components from different technology providers to seamless interoperate and exchange data, and to have an integrated view and access to exploit the full value of the available data. The project will validate the approach in three pilot cases: water quality and availability of Berlin lakes, biodiversity corridors in the metropolitan area of Barcelona and low cost air quality sensors in Europe. The AD4GD project is funded by the European Union under the Horizon Europe program.

How to cite: Masó, J., Brobia, A., Serral, I., Simonis, I., Noardo, F., Bastin, L., Cob Parro, C., García, J., Palma, R., and Ziegler, S.: What does the European Spatial Data Infrastructure INSPIRE need in order to become a Green Deal Data Space?, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-5936, https://doi.org/10.5194/egusphere-egu23-5936, 2023.

08:50–09:00

EGU23-7842

Highlight

On-site presentation

Destination Earth - Processing Near Data and Massive Data Handling

Danaele Puechmaille, Jordi Duatis Juarez, Miruna Stoicescu, Michael Schick, and Borys Saulyak

Destination Earth is an operational service under the lead of the European Commission being implemented jointly by ESA, ECMWF and EUMETSAT.

The presentation will provide insights of how Destination Earth provides Near Data Processing and deals with Massive Data.

The objective of the European Commission’s Destination Earth (DestinE) initiative is to deploy several highly accurate digital replicas of the Earth (Digital Twins) in order to monitor and simulate natural as well as human activities and their interactions, to develop and test “what-if” scenarios that would enable more sustainable developments and support European environmental policies. DestinE addresses the challenge to manage and make accessible the sheer amount of data generated by the Digital Twins and observation data located at external sites such as the ones depicted in the figure below. This data will be made available fast enough and in a format ready to support analysis scenarios proposed by the DestinE service users.

Figure 1 : DestinE Data Sources (green) and Stakeholders (orange)

The “DestinE Data Lake” (DEDL) is one of the three Destination Earth components interacting with:

the Digital Twin Engine (DTE), which runs the simulation models, under ECMWF responsibility
the DestinE Core Service Platform (DESP), which represents the user entry point to the DestinE services and data, under ESA responsibility

The DestinE Data Lake (DEDL) fulfils the storage and access requirements for any data that is offered to DestinE users. It provides users with a seamless access to the datasets, regardless of data type and location. Furthermore, the DEDL supports big data processing services, such as near-data processing to maximize throughput and service scalability. The data lake is built inter alia upon existing data lakes such as Copernicus DIAS, ESA, EUMETSAT, ECMWF as well as complementary data from diverse sources like federated data spaces, in-situ or socio-economic data. The DT Data Warehouse is a sub-component of the DEDL which stores relevant subsets of the output from each digital twin (DT) execution being powered by ECMWFs Hyper-Cube service.

During the session, EUMETSAT’s representative will share to the community how the Destination Earth Data Lake component implements and takes advantage of Near Data Processing and also how the System handles massive data access and exchange. The Destination Earth Data Portfolio will be presented.

Figure 2: Destination Earth Data Portfolio

How to cite: Puechmaille, D., Duatis Juarez, J., Stoicescu, M., Schick, M., and Saulyak, B.: Destination Earth - Processing Near Data and Massive Data Handling, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7842, https://doi.org/10.5194/egusphere-egu23-7842, 2023.

09:00–09:10

EGU23-5862

On-site presentation

The ESA Green Transition Information Factories – using Earth Observation and cloud-based analytics to address the Green Transition information needs.

Patrick Griffiths, Stefanie Lumnitz, Christian Retscher, Frank-Martin Seifert, and Yves-Louis Desnos

In response to the global climate and sustainability crisis, many countries have expressed ambitions goals in terms of carbon neutrality and a green economy. In this context, the European Green Deal comprises several policy elements aimed to achieve carbon neutrality by 2050.

In response to these ambitions, the European Space Agency (ESA) is initiating various efforts to leverage on space technologies and data and support various Green Deal ambitions. The ESA Space for Green Future (S4GF) Accelerator will explore new mechanisms to promote the use of space technologies and advanced modelling approaches for scenario investigations on the Green Transition of economy and society.

A central element of the S4GF accelerator are the Green Transition Information Factories (GTIF). GTIF takes advantage of Earth Observation (EO) capabilities, geospatial and digital platform technologies, as well as cutting edge analytics to generate actionable knowledge and decision support in the context of the Green Transition.

A first national scale GTIF demonstrator has now been developed for Austria.
It addressed the information needs and national priorities for the Green Deal in Austria. This is facilitated through a bottom-up consultation and co-creation process with various national stakeholders and expert entities. These requirements are matched with various EO industry teams that

The current GTIF demonstrator for Austria (GTIF-AT) builds on top of federated European cloud services, providing efficient access to key EO data repositories and rich interdisciplinary datasets. GTIF-AT initially addresses five Green Transition domains: (1) Energy Transition, (2) Mobility Transition, (3) Sustainable Cities, (4) Carbon Accounting and (5) EO Adaptation Services.

For each of these domains, scientific narratives are provided and elaborated using scrollytelling technologies. The GTIF interactive explore tools allow various users to explore the domains and subdomains in more detail to investigate better understand the challenges, complexities, and underlying socio-economic and environmental conflicts. The GTIF interactive explore tools combine domain specific scientific results with intuitive Graphical User Interfaces and modern frontend technologies. In the GTIF Energy Transition domain, users can interactively investigate the suitability of locations at 10m resolution for the expansion of renewable (wind or solar) energy production. The tools also allow investigating the underlying conflicts e.g., with existing land uses or biodiversity constraints. Satellite based altimetry is used to dynamically monitor the water levels in hydro energy reservoirs to infer the related energy storage potentials. In the sustainable cities’ domain, users can investigate the photovoltaic installments on rooftops and assess the suitability in terms of roof geometry and expected energy yields.

GTIF enables various users to inform themselves and interactively investigate the challenges but also opportunities related to the Green Transition ambitions. This enables, for example, citizens to engage in the discussion process for the renewable energy expansion or support energy start-ups to develop new services. The GTIF development follows an open science and open-source approach and several new GTIF instances are planned for the next years, addressing the Green Deal information needs and accelerating the Green Transition. This presentation will showcase some of the GTIF interactive explore tools and provide an outlook on future efforts.

How to cite: Griffiths, P., Lumnitz, S., Retscher, C., Seifert, F.-M., and Desnos, Y.-L.: The ESA Green Transition Information Factories – using Earth Observation and cloud-based analytics to address the Green Transition information needs., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-5862, https://doi.org/10.5194/egusphere-egu23-5862, 2023.

09:10–09:20

EGU23-14237

On-site presentation

Environmental data value stream as traceable linked data - Iliad Digital Twin of the Ocean case

Piotr Zaborowski, Rob Atkinson, Alejandro Villar Fernandez, Raul Palma, Ute Brönner, Arne Berre, Bente Lilja Bye, Tom Redd, and Marie-Françoise Voidrot

In the distributed heterogeneous environmental data ecosystems, the number of data sources, volume and variances of derivatives, purposes, formats, and replicas are increasingly growing. In theory, this can enrich the information system as a whole, enabling new data value to be revealed via the combination and fusion of several data sources and data types, searching for further relevant information hidden behind the variety of expressions, formats, replicas, and unknown reliability. It is now visible how complex data alignment is, and even more, it is not always justified due to capacity and business issues. One of the challenging, but also most rewarding approaches is semantic alignment, which promises to fill the information gap of data discovery and joins. To formalise one, an inevitable enabler is an aligned, linked, and machine readable data model enabling the specification of relations between data elements generated information. The Iliad - digital twins of the ocean are cases of this kind, where in-situ data and citizen science observations are mixed with multidimensional environmental data to enable data science and what-if models implementation and to be integrated into even broader ecosystems like the European Digital Twin Ocean (EDITO) and European Data Spaces. An Ocean Information Model (OIM) that will enable traversals and profiles is the semantic backbone of the ecosystem. Defined as the multi-level ontology, it will explain data using well known generic (Darwin Core, WoT), spatio-temporal (SOSA/SSN, OGC Geo, W3C Time, QUDT, W3C RDF Data Cube, WoT) and domain (WORMS, AGROVOC) ontologies. Machine readability and unambiguity allow for both automated validation and some translations.
On the other hand, efficient use of this requires yet another skill in data management and development besides GIS, ICT and domain expertise. In addition, as the semantics used in the data and metadata have not yet been stabilised on the implementation level, it introduces a few more flexibilities of data expression. Following the GEO data sharing and data management principles along with FAIR, CARE and TRUST, the environmental data is prepared for harmonisation. Furthermore, to ease the entry and to harmonise conventions, the authors introduce a multi-touchpoint data value chain API suite with an aligned approach to semantically enrich, entail and validate data sets such as observations streams in JSON or JSON-LD based on OIM, through storage and scientific data in NetCDF to exposing this semantically aligned data via the newly endorsed and already successful OGC Environmental Data Retrieval API. The practical approach is supported by a ready-to-use toolbox of components that presents portable tools to build and validate multi-source geospatial data integrations keeping track of the information added during mesh-up and predictions and what-if implementations.

How to cite: Zaborowski, P., Atkinson, R., Villar Fernandez, A., Palma, R., Brönner, U., Berre, A., Bye, B. L., Redd, T., and Voidrot, M.-F.: Environmental data value stream as traceable linked data - Iliad Digital Twin of the Ocean case, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14237, https://doi.org/10.5194/egusphere-egu23-14237, 2023.

09:20–09:30

EGU23-9291

On-site presentation

Harmonising the sharing of marine observation data considering data quality information

Simon Jirka, Christian Autermann, Joaquin Del Rio Fernandez, Markus Konkol, and Enoc Martínez

Marine observation data is an important source of information for scientists to investigate the state of the ocean environment. In order to use data from different sources it is critical to understand how the data was acquired. This includes not only information about the measurement process and data processing steps, but also details on data quality and uncertainty. The latter aspect becomes especially important if data from different types of instruments shall be used. An example for this is the combined use of expensive high-precision instruments in conjunction with lower-cost but less precise instruments in order to densify observation networks.

Within this contribution we will present the work of the European MINKE project which intends, among further objectives, to facilitate the quality-aware and interoperable exchange of marine observation data.

For this purpose, a comprehensive review of existing interoperability standards and encodings has been performed by the project partners. This included aspects such as:

standards for encoding observation data
standards for describing sensor data (metadata)
Internet of Things protocols for transmitting data from sensing devices
interfaces for data access

From a technical perspective, the evaluation has especially considered developments such as the OGC API family of standards, lightweight data and metadata encodings, as well as developments coming from the Internet of Things community. This has been complemented by an investigation of relevant vocabularies that may be used for enabling semantic interoperability through a common terminology within data sets and corresponding metadata.

Furthermore, specific consideration was given to the description of different properties that help to assess the quality of an observation data sets. This comprises not only the description of the data itself but also quality related aspects of data acquisition processes. For this purpose, the MINKE project is working on recommendations how to enhance the analysed (meta) data models and encodings with further elements to better transport critical information for better interpreting data sources with regard to the accuracy, uncertainty and re-usability.

Within our contribution we will present the current state of the work within the MINKE project, the results achieved so far and the practical implementations that are performed in cooperation with the project partners.

How to cite: Jirka, S., Autermann, C., Del Rio Fernandez, J., Konkol, M., and Martínez, E.: Harmonising the sharing of marine observation data considering data quality information, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-9291, https://doi.org/10.5194/egusphere-egu23-9291, 2023.

09:30–09:40

EGU23-3657

On-site presentation

CADS 2.0: A FAIRest Data Store infrastructure blooming in a landscape of Data Spaces.

Angel lopez alos, Baudouin raoult, Edward comyn-platt, and James varndell

First launched as the Climate Data Store (CDS) supporting the Climate Change Service (C3S) and later instantiated as the Atmosphere Data Store (ADS) for the Atmosphere Monitoring Service (CAMS), the shared underlaying Climate & Atmosphere Data Store Infrastructure (CADS) represents the technical backbone for the implementation of Copernicus services entrusted to ECMWF on behalf of the European Commission. CDS in addition also offer access to a selection of datasets from the Emergency Management Service (CEMS). As the flagship instance of the infrastructure, CDS counts with more than 160k registered users and delivers a daily average over 100 TBs of data from a catalogue of 141 datasets.

CADS Software Infrastructure is designed as a distributed system and open framework that facilitates improved access to a broad spectrum of data and information via a powerful service-oriented architecture offering seamless web-based and API-based search and retrieve capabilities. CADS also provides a generic software toolbox that allow users to make use of available datasets and a series of state-of-the-art data tools that can be combined into more elaborated processes, and present results graphically in the form of interactive web applications. CADS Infrastructure is hosted in an on-premises Cloud physically located within ECMWF Data Centre and implemented using a collection of virtual machines, networks and large data volumes. Fully customized instances of CADS, including dedicated Virtual Hardware Infrastructure, Software Application and Catalogued content can be easily deployed thanks to implemented automatization and configuration software tools and a set of configuration files which are managed by a distributed version control system. Tailored scripts and templates allow to easily accommodate different standards and interoperate with external platforms.

ECMWF in partnership with EUMETSAT, ESA and EEA also implement the Data and Information Access Services (DIAS) platform called WEkEO, a distributed cloud-computing infrastructure used to process and make the data generated by Copernicus Services accessible to users together with derived products and all satellite data from the Copernicus Sentinels. Within the partnership ECMWF is responsible for the procurement of the software to implement Data Access Services, Processing and Tools which specifications build on the same fundamentals than CADS. Adoption of FAIR principles has demonstrated cornerstone to maximize synergies and interactions between CADS, WEkEO and other related platforms.

Driven by the increasing demand and the evolving landscape of platforms and services a major project for the modernization of the CADS infrastructure is currently underway. The coming CADS 2.0 aims to capitalize experience, feedbacks, lesson learned, know-how from current CADS, embrace advanced technologies, engage with a broader user community, make the current platform more versatile and cloud oriented, improve workflows and methodologies, ensure compatibility with state-of-the-art solutions such as machine learning, data cubes and interactive notebooks, consolidate the adoption of FAIR principles and strength synergies with related platforms.

As complementary Infrastructures, WEkEO will allow users to harness compute resources without the networking and storage costs associated with public Cloud offerings in where CADS Toolbox 2.0 will deploy and run allowing heavy jobs (retrieval and reduction) to be submitted to CADS 2.0 core infrastructure as services.

How to cite: lopez alos, A., raoult, B., comyn-platt, E., and varndell, J.: CADS 2.0: A FAIRest Data Store infrastructure blooming in a landscape of Data Spaces., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3657, https://doi.org/10.5194/egusphere-egu23-3657, 2023.

09:40–09:50

EGU23-10223

On-site presentation

Identifying and Describing Billions of Objects: an Architecture to Tackle the Challenges of Volume, Variety, and Variability

Jens Klump, Doug Fils, Anusuriya Devaraju, Sarah Ramdeen, Jesse Robertson, Lesley Wyborn, and Kerstin Lehnert

Persistent identifiers are applied to an ever-increasing diversity of research objects, including data, software, samples, models, people, instruments, grants, and projects. There is a growing need to apply identifiers at a finer and finer granularity. The systems developed over two decades ago to manage identifiers and the metadata describing the identified objects struggle with this increase in scale. Communities working with physical samples have grappled with these challenges of the increasing volume, variety, and variability of identified objects for many years. To address this dual challenge, the IGSN 2040 project explored how metadata and catalogues for physical samples could be shared at the scale of billions of samples across an ever-growing variety of users and disciplines. This presentation outlines how identifiers and their describing metadata can be scaled to billions of objects. In addition, it analyses who the actors involved with this system are and what their requirements are. This analysis resulted in the definition of a minimum viable product and the design of an architecture that addresses the challenges of increasing volume and variety. The system is also easy to implement because it reuses commonly used Web components. Our solution is based on a Web architectural model that utilises Schema.org, JSON-LD and sitemaps. Applying these commonly used architectural patterns on the internet allows us not only to handle increasing volume, variety and variability but also enable better compliance with the FAIR Guiding Principles.

How to cite: Klump, J., Fils, D., Devaraju, A., Ramdeen, S., Robertson, J., Wyborn, L., and Lehnert, K.: Identifying and Describing Billions of Objects: an Architecture to Tackle the Challenges of Volume, Variety, and Variability, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-10223, https://doi.org/10.5194/egusphere-egu23-10223, 2023.

09:50–10:00

EGU23-2454

ECS

On-site presentation

The future of NASA Earth Science in the commercial cloud: Challenges and opportunities

Alexey Shiklomanov, Manil Maskey, Yoseline Angel, Aimee Barciauskas, Philip Brodrick, Brian Freitag, and Jonas Sølvsteen

NASA produces a large volume and variety of data products that are used every day to support research, decision making, and education. The widespread use of NASA’s Earth Science data is enabled by NASA’s Earth Science Data System (ESDS) program, which oversees the archiving and distribution of these data and invests in the development of new data systems and tools. However, NASA’s current approach to Earth Science data distribution — based on distributed institutional archives with individual on-premises high-performance computing capabilities — faces some significant challenges, including massive increases in data volume from upcoming missions, a greater need for transdisciplinary science that synthesizes many different kinds of observations, and a push to make science more open, inclusive, and accessible. To address these challenges, NASA is aggressively migrating its Earth Science data and related tools and services into the commercial cloud. Migration of data into the commercial cloud can significantly improve NASA’s existing data system capabilities by (1) providing more flexible options for storage and compute (including rapid, as-needed access to state-of-the-art capabilities); (2) by centralizing and standardizing data access, which gives all of NASA’s institutional data centers access to all of each other’s datasets; and (3) by facilitating “analysis-in-place”, whereby users can bring their own computational workflows and tools to the data rather than having to maintain their own copies of NASA datasets. However, migration to the commercial cloud also poses some significant challenges, including (1) managing costs under a “pay-as-you-go” model; (2) incompatibility with existing tools and data formats with object-based storage and network access; (3) vendor lock-in; (4) challenges with data access for workflows that mix on-premise and cloud computing; and (5) standardization for highly diverse data as is present in NASA’s data archive. I conclude with two examples of recent NASA activities showcasing capabilities enabled by the commercial cloud: An interactive analysis and development platform for analyzing airborne imaging spectroscopy data, and a new collection of tools and services for data discovery, analysis, publication, and data-driven storytelling (Visualization, Exploration, and Data Analysis, VEDA).

How to cite: Shiklomanov, A., Maskey, M., Angel, Y., Barciauskas, A., Brodrick, P., Freitag, B., and Sølvsteen, J.: The future of NASA Earth Science in the commercial cloud: Challenges and opportunities, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-2454, https://doi.org/10.5194/egusphere-egu23-2454, 2023.

10:00–10:10

EGU23-7052

On-site presentation

FAIRiCUBE: Enabling Gridded Data Analysis for All

Katharina Schleidt and Stefan Jetschny

Previously, collecting, storing, owning and, if necessary, digitizing data was vital for any data-driven application. Nowadays, we are swimming in data, whereby one could postulate that we are drowning. However, downloading vast data to local storage and subsequent in-house processing on dedicated hardware is inefficient and not in line with the big data processing philosophy. While the FAIR principles are fulfilled as the data is findable, accessible, and interoperable, the actual reuse of the data to gain new insights depends on the data user’s local capabilities. Scientists aware of the potentially available data and processing capabilities are still not able to easily leverage these resources as required to perform their work; while the analysis gap entailed by the information explosion is being increasingly highlighted, remediation lags.

The core objective of the FAIRiCUBE project is to enable players from beyond classic Earth Observation (EO) domains to provide, access, process, and share gridded data and algorithms in a FAIR and TRUSTable manner. To reach this objective, we are creating the FAIRiCUBE HUB, a crosscutting platform and framework for data ingestion, provision, analysis, processing, and dissemination, to unleash the potential of environmental, biodiversity and climate data through dedicated European data spaces.

In order to gain a better understanding of the various obstacles to leveraging available assets in regard to both data as well as analysis and processing modalities, several use cases have been defined addressing diverse aspects of European Green Deal (EGD) priority actions. Each of the use cases has a defined objective, approach, research question and data requirements.

The use cases selected to guide the creation of the FAIRiCUBE HUB are as follows:

Urban adaptation to climate change
Biodiversity and agriculture nexus
Biodiversity occurrence cubes
Drosophila landscape genomics
Spatial and temporal assessment of neighborhood building stock

Many of the issues encountered within the FAIRiCUBE project are formally considered solved. Catalogues are available detailing the available datasets, standards define how the datasets are to be structured and annotated with the relevant metainformation. A vast array of processing functionality has emerged that can be applied to such resources. However, while all this is considered state-of-the-art in the EO community, there is a subtle delta blocking access to wider communities that could make good use of the available resources pertaining to their own domains of work. These include, but are not limited to:

Identifying available data sources
Determining fitness for use
Interoperability of data with divergent spatiotemporal basis
Understanding access modalities
Scoping required resources
Providing non-gridded data holdings in a gridded manner

There is great potential in integrating the diverse gridded resources available from EO sources within wider research domains. However, at present, there are subtle barriers blocking this potential. Within FAIRiCUBE, these issues are being collected and evaluated, mitigation measures are being explored together with researchers not from traditional EO domains, with the goal of breaking down these barriers, and enabling powerful research and data analysis potential to a wide range of scientists.

How to cite: Schleidt, K. and Jetschny, S.: FAIRiCUBE: Enabling Gridded Data Analysis for All, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7052, https://doi.org/10.5194/egusphere-egu23-7052, 2023.

Posters on site: Thu, 27 Apr, 16:15–18:00 | Hall X4

Chairpersons: Paolo Mazzetti, Francesca Piatto

X4.140

EGU23-7074

ECS

An EOSC-enabled Data Space environment for the climate community

Fabrizio Antonio, Donatello Elia, Guillaume Levavasseur, Atef Ben Nasser, Paola Nassisi, Alessandro D'Anca, Alessandra Nuzzo, Sandro Fiore, Sylvie Joussaume, and Giovanni Aloisio

The exponential increase in data volumes and complexities is causing a radical change in the scientific discovery process in several domains, including climate science. This affects the different stages of the data lifecycle, thus posing significant data management challenges in terms of data archiving, access, analysis, visualization, and sharing. The data space concept can support scientists' workflow and simplify the process towards a more FAIR use of data.

In the context of the European Open Science Cloud (EOSC) initiative launched by the European Commission, the ENES Data Space (EDS) represents a domain-specific implementation of the data space concept. The service, developed in the frame of the EGI-ACE project, aims to provide an open, scalable, cloud-enabled data science environment for climate data analysis on top of the EOSC Compute Platform. It is accessible in the European Open Science Cloud (EOSC) through the EOSC Catalogue and Marketplace (https://marketplace.eosc-portal.eu/services/enes-data-space) and it also provides a web portal (https://enesdataspace.vm.fedcloud.eu) including information, tutorials and training materials on how to get started with its main features.

The EDS integrates into a single environment ready-to-use climate datasets, compute resources and tools, all made available through the Jupyter interface, with the aim of supporting the overall scientific data processing workflow. Specifically, the data store linked to the ENES Data Space provides access to a multi-terabyte set of variable-centric collections from large-scale global climate experiments. The data pool consists of a mirrored subset of CMIP (Coupled Model Intercomparison Project) datasets from the ESGF (Earth System Grid Federation) federated data archive, collected and kept synchronized with the remote copies by using the Synda tool developed within the scope of the IS-ENES3 H2020 project. Community-based, open source frameworks (e.g., Ophidia) and libraries from the Python ecosystem provide the capabilities for data access, analysis and visualisation. Results and experiment definitions (i.e., Jupyter Notebooks) can be easily shared among users promoting data sharing and application re-use towards a more Open Science approach.

An overview of the data space capabilities along with the key aspects in terms of data management will be presented in this work.

How to cite: Antonio, F., Elia, D., Levavasseur, G., Ben Nasser, A., Nassisi, P., D'Anca, A., Nuzzo, A., Fiore, S., Joussaume, S., and Aloisio, G.: An EOSC-enabled Data Space environment for the climate community, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7074, https://doi.org/10.5194/egusphere-egu23-7074, 2023.

X4.141

EGU23-8041

Coastal Digital Twins: building knowledge through numerical models and IT tools

Anabela Oliveira, André B. Fortunato, Gonçalo de Jesus, Marta Rodrigues, and Luís David

Digital Twins integrate continuously, in an interactive, two-way data connection, the real and the virtual assets. They provide a virtual representation of a physical asset enabled through data and models and can be used for multiple applications such as real-time forecast, system optimization, monitoring and controlling, and support enhanced decision making. These recent tools take advantage of the huge online volume of data streams provided by satellites, IoT sensing and many real time surveillance platforms, and the availability of powerful computational resources that made process-solving, high resolution models or AI-based models possible, to build high accuracy replicas of the real world.
In this paper, the adaptation of the concept of Digital Twins is extended from the ocean to the coastal zones, handling the high non-linear physics and the complexity of monitoring these regions, using the on-demand coastal forecast framework OPENCoastS (Oliveira et al., 2020; Oliveira et al., 2021) to build a user-centered data spaces where multiple services, from early-warning tools to collaboratory platforms, are customized to meet the users needs. Computational effort and data requirements for these services is high, integration of Coastal Digital Twins in federated computational infrastructures, such as European Open Science Cloud (EOSC) or INCD in Portugal, to guarantee the capacity to serve multiple users simultaneously.

This tool is demonstrated in the coastal area of Albufeira, located in the southern part of Portugal, in the scope of the SINERGEA innovation project. Coastal cities face growing challenges from flooding, sea water quality and energy sustainability, which increasingly require an intelligent, real-time management. The urban drainage infrastructures transport to the wastewater treatment plants all waters likely to pollute downstream beaches. Real-time tools are required to support the assessment and prediction of the quality of bathing waters, to assess the possible need to prohibit beach water usage. During heavy rainfall events, a decentralized management systems can also contribute to mitigate downstream flooding. This requires the operation of the entire system to be optimized depending on the specific environmental conditions, and the participation and access to all the information by the several stakeholders. This system integrates real-time information provided by different entities, including monitoring networks, infrastructure operation data and a forecasting framework. The forecasting system includes several models covering all relevant water compartments: atmospheric, rivers and streams, urban stormwater and wastewater infrastructure, and receiving coastal water bodies circulation and water quality predictions.

References

A. Oliveira, A.B. Fortunato, M. Rodrigues, A. Azevedo, J. Rogeiro, S. Bernardo, L. Lavaud, X. Bertin, A. Nahon, G. Jesus, M. Rocha, P. Lopes, 2021. Forecasting contrasting coastal and estuarine hydrodynamics with OPENCoastS, Environmental Modelling & Software, Volume 143,105132, ISSN 1364-8152, https://doi.org/10.1016/j.envsoft.2021.105132.

A. Oliveira, A.B. Fortunato, J. Rogeiro, J. Teixeira, A. Azevedo, L. Lavaud, X. Bertin, J. Gomes, M. David, J. Pina, M. Rodrigues, P. Lopes, 2019. OPENCoastS: An open-access service for the automatic generation of coastal forecast systems, Environmental Modelling & Software, Volume 124, 104585, ISSN 1364-8152, https://doi.org/10.1016/j.envsoft.2019.104585.

How to cite: Oliveira, A., B. Fortunato, A., de Jesus, G., Rodrigues, M., and David, L.: Coastal Digital Twins: building knowledge through numerical models and IT tools, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8041, https://doi.org/10.5194/egusphere-egu23-8041, 2023.

X4.142

EGU23-13501

The brokering approach empowering the WMO data space for hydrology

Enrico Boldrini, Paolo Mazzetti, Fabrizio Papeschi, Roberto Roncella, Massimiliano Olivieri, Washington Otieno, Igor Chernov, Silvano Pecora, and Stefano Nativi

WMO is coordinating the efforts to build a data space for hydrology, called the WMO Hydrological Observing System (WHOS).
Hydrological datasets have intrinsic value and are worth the enormous human, technological and financial resources required to collect them over long periods of time. Their value is maximized when data is open, of quality, discoverable, accessible, interoperable, standardized, and addressing user needs, enabling various sector users to use and reuse the data. It is essential that hydrological data management and exchange is implemented effectively to maximize the benefits of data collection and optimize reuse.
WHOS provides a service-oriented framework that connects data providers to data consumers. It realizes a system of systems that provides registry, discovery, and access capabilities to hydrology data at different levels (local, basin, regional, global). In 2015, the World Meteorological Congress supported the full implementation of WHOS, which is currently publicly available at https://community.wmo.int/activity-areas/wmo-hydrological-observing-system-whos, along with information for both end users and data providers about how to use and join it.
End users (such as hydrologists, forecasters, decision makers, general public, academia) can discover, access, download and further process hydrological data available through WHOS portal by means of their preferred clients (web applications, tools and libraries).
Data providers (such as National Meteorological and Hydrological Services - NMHSs, river basin authorities, private companies, academia) can share their data through WHOS by publishing it online by means of machine-to-machine web services.
The brokering approach powered by the Discovery and Access Broker (DAB) technology enables the interoperability between data providers’ services and end users’ clients. A mediation layer implemented by the DAB brokering framework mediates between the different standard protocols and data models used by both providers and consumers to seamlessly enable the data flow from heterogeneous data providers to the clients of each end user.
In parallel, WHOS experts are working in constant collaboration with the data providers to support the implementation of the latest standards required by the international guidelines (e.g., WaterML2.0 and WIGOS Metadata Standard), optimize the data publication and improve the metadata and data quality.
The WHOS Distance Learning course has been successfully conducted; attenders from NMHSs were provided updated information and guidelines to optimize their hydrological data sharing. The course is currently being translated into Spanish to carry out it for Spanish speaking countries in 2023.
WHOS is a hydrological component of WMO Information System (WIS), which is currently in its pilot phase. WHOS and WIS Interoperability tests are currently being piloted and expected to end in 2023. The aim of this interoperability is to promote smooth data exchange between Hydrology community and the wider WMO community. Finally, hydrological data shared through WHOS will be accessible to general WIS users (all piloted programmes, including climate through OpenCDMS, and cryoshere) and at the same time WHOS users will make use of observations made available by WIS.

How to cite: Boldrini, E., Mazzetti, P., Papeschi, F., Roncella, R., Olivieri, M., Otieno, W., Chernov, I., Pecora, S., and Nativi, S.: The brokering approach empowering the WMO data space for hydrology, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13501, https://doi.org/10.5194/egusphere-egu23-13501, 2023.

X4.143

EGU23-8394

Spatio-Temporal Datacube Infrastructures as a Basis for Targeted Data Spaces

Peter Baumann

Data Spaces promise an innovative packaging of data and services into targeted one-stop shops of insight. A key ingredient for the fulfilment of the Data Space promise is easier, analysis-ready and fit-for-purpose access in particular to the Big Data which EO pixels and voxels constitute. Datacubes have proven to offer suitable service concepts and today are considered an acknowledge cornerstone.

In the GAIA-X EO Expert Group, a subgroup of the Geoinformation Working Group, one of the use cases investigated is the EarthServer federation. It bridges a seeming contradiction: a decentralized approach of independent data providers - with heterogeneous offerings, paid as well as free - versus
a single, common pool of datacubes where users do not need to know where data sit inorder to access, analyse, mix, and match them. Currently, a total of 140+ Petabyte is online available.

Membership in EarthServer is open and free, with a Charter being finalized ensuring transparent and democratic governance (one data provider - one vote). EarthServer thereby presents a key building block for the forthcoming Data Spaces: not only does it allow unifying data within a given Data Space, it also acts as a natural enabler for bridging and integrating different Data Spaces. This is amplified by the fact that the technology underlying EarthServer
is both the OGC datacube reference implementation and the INSPIRE Good Practice.

In our talk we present concept and practice of location-transparent datacube federations, exemplified by EarthServer, and its opportunities for future-directed Data Spaces.

How to cite: Baumann, P.: Spatio-Temporal Datacube Infrastructures as a Basis for Targeted Data Spaces, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8394, https://doi.org/10.5194/egusphere-egu23-8394, 2023.

X4.144

EGU23-5038

EO4EU - AI-augmented ecosystem for Earth Observation data accessibility with Extended reality User Interfaces for Service and data exploitation

Vasileios Baousis, Stathes Hadjiefthymiades, Charalampos Andreou, Kakia Panagidh, and Armagan Karatosun

EO4EU is a European Commission-funded innovation project bringing forward the EO4EU Platform which will access and use of EO data easier for environmental, government, and even business forecasts and operations.

The EO4EU Platform, which will be accessible at www.eo4eu.eu, will link already existing major EO data sources such as GEOSS, INSPIRE, Copernicus, Galileo, DestinE among others and provide a number of tools and services to assist users to find and access the data they are interested in, as well as to analyse and visualise this data. The platform will leverage machine learning to support the handling of the characteristically large volume of EO data as well as a combination of Cloud computing infrastructure and pre-exascale high-performance computing to manage processing workloads.

Specific attention is also given to developing user-friendly interfaces for EO4EU allowing users to intuitively use EO data freely and easily, even with the use of extended reality.

EO4EU objectives are:

Holistic DataOps ecosystem to enhance access and usability of EO information.
A semantic-enhanced knowledge graph that augments the FAIRness of EO data and supports sophisticated data representation and dynamics.
A machine learning pipeline that enables the dynamic annotation of the various EO data sources.
Efficient, reliable and interoperable inter- and intra- data layer communications
Advance stakeholders’ knowledge capacity through informed decision-making and policy-making support.
A full range of use case scenarios addressing current data needs, capitalizing existing digital services and platforms, fostering their usability and practicality, and taking into account ethical aspects aiming at social impact maximization.

Technical and scientific innovation can be summarised as follows:

Improve compression rates for image quality and reduce data volumes.
Improve the quality of reconstructed compressed images, maintaining the same comparison rates
Facilitate the design of custom services with a minimized labelled data requirement
Learn robust and transferable representations of EO data
Publishing original trained models on EO data with all relevant assisting material to support reusability in a public repository.
Data fusion optimized execution in HPC and GPU environment
Better accuracy of data representation
Customizable visualization tools tailored to the needs of each use case
Dedicated graphs for end-users with various granularities, modalities, metrics and statistics to observe the overall trends in time, correlations, and cause-and-effect relationships through a responsive web-interfaced module.

In this presentation, the status of the project, the adopted architecture and the findings from our initial user surveys pertaining to EO data access and discovery will be analysed. Finally, the next steps of the project, the early access to the developed platform and the challenges and opportunities will be discussed.

How to cite: Baousis, V., Hadjiefthymiades, S., Andreou, C., Panagidh, K., and Karatosun, A.: EO4EU - AI-augmented ecosystem for Earth Observation data accessibility with Extended reality User Interfaces for Service and data exploitation, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-5038, https://doi.org/10.5194/egusphere-egu23-5038, 2023.

X4.145

EGU23-15907

GSSC Now - ESA's Thematic Exploitation Platform for Navigation Science Data

Vicente Navarro, Sara del Rio, Emilio Fraile, Luis Mendes, and Javier Ventura

Nowadays, the sheer amount of data collected from space-borne and ground-based sensors, is changing past approaches towards data processing and storage. In the Information Technology domain, the rapid growth of data generation rates, expected to produce 175 zettabytes worldwide by 2025, is changing approaches to data processing and storage dramatically. This landscape has led to a new golden age of Machine learning (ML), able to extract knowledge and discover patterns between input and output variables given the sheer volume of available training data.

In space, over 120 satellites from four Global Navigation Satellite Systems (GNSS), including Galileo, will provide, already this decade, continuous, worlwide signals in several frequencies. On ground, the professional market represented by thousands of permanent GNSS stations has been complemented by billions of mass-market receivers integrated in smartphones and Internet-of-Things (IoT) devices.

Along their travel down to Earth through the atmosphere, multiple sources alter the precisely modulated GNSS signals. As they pass through irregular plasma patches in the ionosphere, GNSS signals undergo delay and fading, formally known as 'scintillation'. Further down, they are modified by the amount of water vapor in the troposphere. These alterations, recorded by GNSS receivers as digital footprints in massive streams of data, represent a valuable resource for science, increasingly employed to study Earth’s atmosphere, oceans, and surface environments.

In order to realize the scientific potential of GNSS data, at the European Space Astronomy Centre (ESAC) near Madrid, the GNSS Science Support Centre (GSSC) led by ESA’s Navigation Science Office, hosts ESA’s data archive for scientific exploitation of GNSS data.

Analysis of Global Navigation Satellite Systems (GNSS) data has traditionally pivoted around the idea of datasets search and download from multiple repositories that act as data-hubs for different types of GNSS resources generated worldwide. In this work we introduce an innovative GNSS Thematic Exploitation Platform, GSSC Now, which expands a GNSS-centric data lake with novel capabilities for discovery and high-performance-computing.

We explain how this platform performs GNSS data fusion from multiple data sources, enabling the deployment of Machine Learning (ML) processors to unleash synergies across science domains.

Finally, through the presentation of several GNSS science use cases, we discuss the implementation of GSSC Now’s cyber-infrastructure, current status, and future plans to accelerate the development of innovative applications and citizen-science.

How to cite: Navarro, V., del Rio, S., Fraile, E., Mendes, L., and Ventura, J.: GSSC Now - ESA's Thematic Exploitation Platform for Navigation Science Data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-15907, https://doi.org/10.5194/egusphere-egu23-15907, 2023.

X4.146

EGU23-5370

Harmonizing Agricultural Long-Term Experiments data through A Geospatial Infrastructure

(withdrawn)

Cenk Dönmez, Carsten Hoffmann, Nikolai Svoboda, Wilfried Hierold, and Xenia Specka

X4.147

EGU23-7786

Constructing a Searchable Knowledge Repository for FAIR Climate Data

Mark Roantree, Branislava Lalić, Stevan Savić, Dragan Milošević, and Michael Scriney

The development of a knowledge repository for climate science data is a multidisciplinary effort between the domain experts (climate scientists), data engineers who's skills include design and building a knowledge repository, and machine learning researchers who provide expertise on data preparation tasks such as gap filling and advise on different machine learning models that can exploit this data.

One of the main goals of the CA20108 cost action is to develop a knowledge portal that is fully compliant with the FAIR principles for scientific data management. In the first year, a bespoke knowledge portal was developed to capture metadata for FAIR datasets. Its purpose was to provide detailed metadata descriptions for shareable micro-meteorological (micromet) data using the WMO standard. While storing Network, Site and Sensor metadata locally, the system passes the actual data to Zenodo, receives back the DOI and thus, creates a permanent link between the Knowledge Portal and the storage platform Zenodo. While the user searches the Knowledge portal (metadata), results provide both detailed descriptions and links to data on the Zenodo platform. Our adherence to FAIR principles are documented below:

Findable. Machine-readable metadata is required for automatic discovery of datasets and services. A metadata description is supplied by the data owners for all micro-meteorological data shared on the system which subsequently drives the search engine, using keywords or network, site and sensor search terms.
Accessible. When suitable datasets have been identified, access details should be provided. Assuming data is freely accessible, Zenodo DOIs and links are provided for direct data access.
Interoperable. Data interoperability means the ability to share and integrate data from different users and sources. This can only happen if a standard (meta)data model is employed to describe data, an important concept which generally requires data engineering skills to deliver. In the knowledge portal presented here, the WMO guide provides the design and structure for metadata.
Reusable. To truly deliver reusability, metadata should be expressed in as detailed a manner as possible. In this way, data can be replicated and integrated according to different scientific requirements. While the Knowledge Portal facilitates very detailed metadata descriptions, not all metadata is compulsory as it was accepted that in some cases, the overhead in providing this information can be very costly.

Simple analytics are in place to monitor the volume and size of networks in the system. Current metrics include: network count; average size of network (number of sites); dates and size of datasets per network/site; numbers and types of sensors in each site, etc. The current Portal is in Beta version meaning that the system is currently functional but open only to members of the Cost Action who are nominated testers. This status is due to change in Q1/2023 when access will be open to the wider climate science community.

Current plans include new Tools and Services to assess the quality of data, including the level of gaps and in some cases, machine learning tools will be provided to attempt gap filling for datasets meeting certain requirements.

How to cite: Roantree, M., Lalić, B., Savić, S., Milošević, D., and Scriney, M.: Constructing a Searchable Knowledge Repository for FAIR Climate Data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7786, https://doi.org/10.5194/egusphere-egu23-7786, 2023.

X4.148

EGU23-13331

Data challenges and opportunities from nascent kilometre-scale simulations

Valentine Anantharaj, Samuel Hatfield, Inna Polichtchouk, and Nils Wedi

Computational experiments using earth system models, approaching kilometre-scale (k-scale) horizontal resolutions, are becoming increasingly common across modeling centers. Recent advances in high performance computing systems, along with efficient parallel algorithms that are capable of leveraging accelerator hardware, have made k-scale models affordable for specific purposes. Surrogate models developed using machine learning methods also promise to further reduce the computational cost while enhancing model fidelity. The “avalanche of data from k-scale models” (Slingo et al., 2022) has also posed new challenges in processing, managing, and provisioning data to the broader user community.

During recent years, a joint effort between the European Center for Medium-Range Weather Forecasts (ECMWF) and the Oak Ridge National Laboratory (ORNL) has succeeded in simulating “a baseline for weather and climate simulations at 1-km resolution,” (Wedi et al., 2020) using the Summit supercomputer at the Oak Ridge Leadership Facility (OLCF). The ECMWF hydrostatic Integrated Forecasting System (IFS), with explicit deep convection on an average grid spacing of 1.4 km, was used to perform a set of experimental nature runs (XNR) spanning two seasons corresponding to a northern hemispheric winter (NDJF), and August - October (ASO) months corresponding the tropical cyclone season in the North Atlantic.

We developed a bespoke workflow to process and archive over 2 PB of data from the 1-km XNR simulations (XNR1K). Further, we have also facilitated access to the XNR1K data via an open science data hackathon. The hackathon projects also have access to a data analytics cluster to further process and analyze the data. The OLCF data center supports high speed data sharing via globus data transfer mechanism. External users are using the XNR1K data for a number of ongoing research projects, including observing system simulation experiments, designing satellite instruments for severe storms, developing surrogate models, understanding atmospheric processes, and generating high-fidelity visualizations.

During our presentation we will share our challenges, experiences and lessons learned related to the processing, provisioning and management of the large volume of data, and the stakeholder engagement and logistics of the open science data hackathon.

Slingo, J., Bates, P., Bauer, P. et al. (2022) Ambitious partnership needed for reliable climate prediction. Nat. Clim. Chang. https://doi.org/10.1038/s41558-022-01384-8

Wedi, N., Polichtchouk, I., et al. (2020) A Baseline for Global Weather and Climate Simulations at 1 km Resolution, JAMES. https://doi.org/10.1029/2020MS002192

How to cite: Anantharaj, V., Hatfield, S., Polichtchouk, I., and Wedi, N.: Data challenges and opportunities from nascent kilometre-scale simulations, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13331, https://doi.org/10.5194/egusphere-egu23-13331, 2023.

X4.149

EGU23-16662

NLP-based Cognitive Search Engine for the GEOSS Platform data

Yannis Kopsinis, Zisis Flokas, Pantelis Mitropoulos, Christos Petrou, Thodoris Siozos, and Giorgos Siokas

Effectively querying unstructured text information in large databases is a highly demanding task. Conventional approaches, such as an exact match or fuzzy search, return valid and thorough results only when the user query adequately matches the wording within the text or the query is included in keyword-tag lists. The GEOSS portal relies on conventional search tools for data and services exploration and retrieval, limiting its capacity. This challenge, recent advances in Artificial Intelligence (AI)-based Natural Language Processing (NLP) try to surpass with enhanced information retrieval and cognitive search. Rather than relying on exact or fuzzy text matching, it detects documents that semantically and conceptually are close enough to the search query.

The EIFFEL EU-funded project aims to reveal the role of GEOSS as the default Digital Portal for building Climate Change (CC) adaption and mitigation applications and offer the Earth Observation community the ground-breaking capacity of exploiting existing GEOSS datasets. To this end, as a lead technological partner of the EIFFEL consortium, LIBRA AI Technologies, designs and develops an end-to-end advanced cognitive search system dedicated to the GEOSS Portal and exceeds current challenges.

The proposed system comprises an AI language model optimized for CC-related text and queries, a framework for collecting a sizeable CC-specific corpus used for the language model specialization, a back-end that adopts modern database technologies with advanced capabilities for embedding-based cognitive search matching, and an open Application Programming Interface (API). The cognitive search component is the backbone of the EIFFEL visualisation engine, which will allow any GEOSS user, as well as the EIFFEL Climate Change application developing teams, to detect GEOSS data objects and services that are of interest for their research and application but could not effectively get accessed with the available GEOSS Portal search engine.

The work described in this abstract is part of the EIFFEL European project. The EIFFEL project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101003518. We thank all partners for their valuable contributions.

How to cite: Kopsinis, Y., Flokas, Z., Mitropoulos, P., Petrou, C., Siozos, T., and Siokas, G.: NLP-based Cognitive Search Engine for the GEOSS Platform data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-16662, https://doi.org/10.5194/egusphere-egu23-16662, 2023.

X4.150

EGU23-16795

Harmonizing Diverse Geo-Spatiotemporal Data for Event Analytics

Michael Rilee, Kwo-Sen Kuo, Michael Bauer, Niklas Griessbaum, and Dai-Hai Ton-That

Parallelization is the only means by which it is possible to process large amounts of diverse data on reasonably short time scales. However, while parallelization is necessary for performant and scalable BigData analysis, it is insufficient. We observe that we most often require spatiotemporal coincidence (i.e., at the same space and time) in geo-spatiotemporal analyses that integrate diverse datasets. Therefore, for parallelization, these large volumes of diverse data must be partitioned and distributed to cluster nodes with spatiotemporal colocation to avoid data movement among the nodes necessitated by misalignment. Such data movement devastates scalability.

The prevalent data structure for most geospatial data, e.g., simulation model output and remote sensing data products, is the (Raster) Array, with accompanying geolocation arrays, i.e., longitude-latitude, of the same shape and size establishing, through the array index, a correspondence between a data array element and its geolocation. However, this array-index-to-geolocation relation is ever-changing from dataset to dataset and even within a dataset (e.g., swath data from LEO satellites). Consequently, it is impossible to use array indices for partitioning and distribution to achieve consistent spatiotemporal colocation.

A simplistic way to address this diversity is through homogenization, i.e., resampling (aka re-gridding) all data involved onto the same grid to establish a fixed array-index-to-geolocation relation. Indeed, this crude approach has become the existing common practice. However, different applications have different requirements for resampling, influencing the choice of the interpolation algorithm (e.g., linear, spline, flux-conserved, etc.). Regardless of which algorithm is applied, large amounts of modified and redundant data are created, which not only exacerbates the BigData Volume challenge but also obfuscates the processing and data provenance.

SpatioTemporal Adaptive-Resolution Encoding, STARE, was invented to address the scalability challenge through data harmonization, allowing efficient spatiotemporal colocation of the “native data” without re-gridding. STARE (1) ties its indices directly to space-time coordinate locations, unlike raster array indices used in the current practice which must go indirectly through the floating-point longitude-latitude arrays to reference geolocation, and (2) embeds neighborhood information in the indices to enable highly performant numerical operations for “joins” such as intersect, union, difference, and complement. These two properties together give STARE its exceptional data-harmonizing power because, when given a pair of STARE indices are associated with a data element, we know not only its spatiotemporal location but also its neighborhood, i.e., the spatiotemporal volume (2D in space plus 1D in time) that the data element represents.

These capabilities of STARE-based technologies allow not only the harmonization of diverse datasets but also sophisticated event analytics. In this presentation, we will discuss the application of STARE to the integrative analysis of Extra-Tropical Cyclones and precipitation events, wherein we use STARE to identify and catalog co-occurrences of these two kinds of events so that we may study their relationships using diverse data of the best spatiotemporal resolution available.

How to cite: Rilee, M., Kuo, K.-S., Bauer, M., Griessbaum, N., and Ton-That, D.-H.: Harmonizing Diverse Geo-Spatiotemporal Data for Event Analytics, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-16795, https://doi.org/10.5194/egusphere-egu23-16795, 2023.