EGU23-14061, updated on 26 Feb 2023
https://doi.org/10.5194/egusphere-egu23-14061
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

The Earth Observation Training Data Lab (EOTDL) - addressing training data related needs in the Earth Observation community.

Patrick Griffiths1, Juan Pedro2, Gunnar Brandt3, Stephan Meissl4, Grega Milcinski5, and Laura Moreno2
Patrick Griffiths et al.
  • 1ESA, Earth Observartion, Frascati (Rome), Italy (patrick.griffiths@esa.int)
  • 2EarthPulse, Sant Joan de la Salle 42, 08022 Barcelona, Spain
  • 3Brockmann Consult GmbH, Chrysanderstr. 1,21029 Hamburg, Germany
  • 4EOX IT Services GmbH , Address: Thurngasse 8/4, 1090 Wien, Austria
  • 5Sinergise, Cvetkova ulica 29 SI-1000 Ljubljana, Slovenia

The availability of large training datasets (TDS) has enabled much of the innovative use of Machine Learning (ML) and Artificial Intelligence (AI) in fields such as computer vision or language processing. In Earth Observation and geospatial science/applications, the availability of TDS has generally been limited and there are a number of specific geospatial challenges to consider (e.g. spatial reference systems, spatial/spectral/radiometric/temporal resolutions). Creating TDS for EO applications commonly involves labor intensive processes and the willingness to share such datasets has been very limited. While the current open accessibility of EO datasets is unprecedented, the availability of training and ground truth datasets has not improved much over the last years, and this is limiting the potential innovative impact that new ML/AI methodologies could have in the EO domain. Next to general availability and accessibility, further challenges need to be addressed in terms of making TDS interoperable and findable and lowing the barriers for non-geospatial experts.

 

In the response to these challenges, ESA has initiated development of the Earth Observation Training Data Lab (EOTDL). EOTDL is being developed on top of federated European cloud infrastructure and aims to address the EO community requirements for working with TDS in EO workflows, adopting FAIR data principles and following open science best-practices.

The specific capabilities that EOTDL will support include:

  • Repository and Curation: host, import and maintain training datasets, ground truth data, pretrained models and benchmarks, providing versioning, tracking and provenance.
  • Tooling: provide a set of integrated open-source tools compatible with the major ML/AI frameworks to create, analyze and optimize TDS and to support data ingestion, model training and inference operations.
  • Feature engineering: Link with the main EO data archives and EO analytics platforms to support feature engineering and large-scale inference.
  • Quality assurance: embed QA throughout the offered capabilities, also taking advantage of automated deterministic checks and defined levels of TDS maturity.

To achieve these goals, EOTDL is building on proven technologies, such as STAC (Spatio Temporal Asset Catalog) to support data cataloguing and discoverability, openEO and SentinelHub APIs for EO data access and feature engineering, GeoDB for vector geometry and attribute handing, and EoxHub to support interactive tooling. The EOTDL functionality will be exposed via web-based GUIs, python libraries and command line interfaces.  

A central objective is also the incentivization of community engagement to support quality assurance and encourage the contribution of datasets. For this award mechanisms are being established. The initial data population consists of around 100 datasets while intuitive data ingestion pipelines allow for continuous community contributions. Three defined product maturity levels are linked to QA procedures and support the trustworthiness of the data population. The development is coordinated with Radiant ML Hub to seek synergies rather than duplicate the offered capabilities.  

This presentation will showcase the current development status of EOTDL and discuss in detail some key aspects such as the data curation with STAC and the adopted quality assurance and feature engineering approaches. A set of use cases that establish new TDS creation tools and result in large scale datasets are presented as well.

How to cite: Griffiths, P., Pedro, J., Brandt, G., Meissl, S., Milcinski, G., and Moreno, L.: The Earth Observation Training Data Lab (EOTDL) - addressing training data related needs in the Earth Observation community., EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-14061, https://doi.org/10.5194/egusphere-egu23-14061, 2023.