ESSI1.8 | Challenges and Opportunities for Findable, Accessible, Interoperable and Re-usable Training Dataset
EDI PICO
Challenges and Opportunities for Findable, Accessible, Interoperable and Re-usable Training Dataset
Convener: Sara Saeedi | Co-conveners: Samantha Lavender, Caitlin Adams
PICO
| Wed, 26 Apr, 16:15–18:00 (CEST)
 
PICO spot 2
Wed, 16:15
Among various fields of exploration in artificial intelligence (AI), the availability of high-quality training datasets is an exciting area of research that holds great potential to make accurate predictions or perform a desired task. Training data is the initial dataset used to train machine learning algorithms and models. Training data is also known as training dataset (TDS), learning set, and training set. The goal of this session is:
1) to discuss the cutting-edge topics of machine learning training data for the geospatial community;
2) to describe the spatial, temporal and thematic representativeness of TDS and their uncertainties;
3) to focus on sharing and reusability of TDS to increase the adaptation of TDS for geospatial analysis.

This session will focus on the following topics around training datasets:
-How to describe a training dataset to enable efficient re-use in ML/AI applications?
-What are the main characteristics of the training dataset, and what additional information needs to be provided to sufficiently understand the privacy, nature and usability of the dataset?
-Exploring the effect of training data accuracy level, uncertainty of the measurement, labelling procedure used to generate the training data, original data used to create labels, external classification schemes for label semantics, e.g. ontologies or vocabularies;
-What metadata is required, recommended, or optionally provided?
-How to express the quality of a TDS? Is it possible to auto-generate quality indicators?
-Evaluating the effect of training data size, spatial resolution and structure, temporal resolution and currency, the spectral resolution of imagery used for annotation, and annotating accuracy.
-Methods for documenting, storing, evaluating, publishing, and sharing the training datasets;
-Transfer learning and impact of combining various training datasets;
-Open standards and open source training datasets;
-How to enable FAIR (findable, accessible, interoperable and reusable) data principles to be at the heart of future TDS standardization.

PICO: Wed, 26 Apr | PICO spot 2

Chairpersons: Samantha Lavender, Sara Saeedi, Nils Hempelmann
16:15–16:20
16:20–16:30
|
PICO2.1
|
EGU23-14061
|
solicited
|
On-site presentation
Patrick Griffiths, Juan Pedro, Gunnar Brandt, Stephan Meissl, Grega Milcinski, and Laura Moreno

The availability of large training datasets (TDS) has enabled much of the innovative use of Machine Learning (ML) and Artificial Intelligence (AI) in fields such as computer vision or language processing. In Earth Observation and geospatial science/applications, the availability of TDS has generally been limited and there are a number of specific geospatial challenges to consider (e.g. spatial reference systems, spatial/spectral/radiometric/temporal resolutions). Creating TDS for EO applications commonly involves labor intensive processes and the willingness to share such datasets has been very limited. While the current open accessibility of EO datasets is unprecedented, the availability of training and ground truth datasets has not improved much over the last years, and this is limiting the potential innovative impact that new ML/AI methodologies could have in the EO domain. Next to general availability and accessibility, further challenges need to be addressed in terms of making TDS interoperable and findable and lowing the barriers for non-geospatial experts.

 

In the response to these challenges, ESA has initiated development of the Earth Observation Training Data Lab (EOTDL). EOTDL is being developed on top of federated European cloud infrastructure and aims to address the EO community requirements for working with TDS in EO workflows, adopting FAIR data principles and following open science best-practices.

The specific capabilities that EOTDL will support include:

  • Repository and Curation: host, import and maintain training datasets, ground truth data, pretrained models and benchmarks, providing versioning, tracking and provenance.
  • Tooling: provide a set of integrated open-source tools compatible with the major ML/AI frameworks to create, analyze and optimize TDS and to support data ingestion, model training and inference operations.
  • Feature engineering: Link with the main EO data archives and EO analytics platforms to support feature engineering and large-scale inference.
  • Quality assurance: embed QA throughout the offered capabilities, also taking advantage of automated deterministic checks and defined levels of TDS maturity.

To achieve these goals, EOTDL is building on proven technologies, such as STAC (Spatio Temporal Asset Catalog) to support data cataloguing and discoverability, openEO and SentinelHub APIs for EO data access and feature engineering, GeoDB for vector geometry and attribute handing, and EoxHub to support interactive tooling. The EOTDL functionality will be exposed via web-based GUIs, python libraries and command line interfaces.  

A central objective is also the incentivization of community engagement to support quality assurance and encourage the contribution of datasets. For this award mechanisms are being established. The initial data population consists of around 100 datasets while intuitive data ingestion pipelines allow for continuous community contributions. Three defined product maturity levels are linked to QA procedures and support the trustworthiness of the data population. The development is coordinated with Radiant ML Hub to seek synergies rather than duplicate the offered capabilities.  

This presentation will showcase the current development status of EOTDL and discuss in detail some key aspects such as the data curation with STAC and the adopted quality assurance and feature engineering approaches. A set of use cases that establish new TDS creation tools and result in large scale datasets are presented as well.

How to cite: Griffiths, P., Pedro, J., Brandt, G., Meissl, S., Milcinski, G., and Moreno, L.: The Earth Observation Training Data Lab (EOTDL) - addressing training data related needs in the Earth Observation community., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14061, https://doi.org/10.5194/egusphere-egu23-14061, 2023.

16:30–16:32
|
PICO2.2
|
EGU23-3493
|
On-site presentation
|
Samantha Lavender, Caitlin Adams, Ivana Ivánová, and Kate Williams

Training datasets are a crucial component of any machine learning approach, with significant human effort spent creating and curating these for specific applications. However, a historical absence of standards has resulted in inconsistent and heterogeneous training datasets with limited discoverability and interoperability. Therefore, there is a need for best practices and guidelines for generating, structuring, describing, and curating training datasets.

The Open Geospatial Consortium (OGC) Testbed-18 initiative covered several topics related to geospatial data, focussing on issues around cataloguing and interoperability. Within Testbed-18, the Machine Learning Training Datasets task aimed to develop a foundation for future standardization of training datasets for Earth observation applications.

For this task, members from Pixalytics, FrontierSI, and Curtin University authored an Engineering Report that reviewed:
·       Examples of how training datasets have been used in Earth observation applications
·       The current best-practice methods for documenting training datasets
·       The various requirements for training dataset metadata
·       How the Findability, Accessibility, Interoperability, and Reuse (FAIR) principles apply to training datasets

The Engineering Report provides a foundation that OGC can leverage in creating the future standard for machine learning training data for Earth observation applications. The Engineering Report also provides a useful overview of the state of work and key considerations for anyone wishing to improve how they document their training datasets.

In our presentation, we discuss the key findings from the Engineering Report, including key metadata identified from Earth observation use cases, the current state of the art, thoughts on cataloguing and describing training data quality, and how the FAIR principles apply to training data. 

How to cite: Lavender, S., Adams, C., Ivánová, I., and Williams, K.: OGC Testbed-18 Machine Learning Training Datasets Task: Application of standards to Machine Learning training datasets, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3493, https://doi.org/10.5194/egusphere-egu23-3493, 2023.

16:32–16:34
|
PICO2.3
|
EGU23-16998
|
Virtual presentation
Peng Yue, Boyi Shangguan, and Danielle Ziebelin

The development of Artificial Intelligence (AI), especially Machine Learning (ML) technology, has injected new vitality into the geospatial domain. Training Data (TD) plays a fundamental role in geospatial AI/ML. They are key items for training, validating, and testing AI/ML models. At present, open access Training Datasets (TDS) are usually packaged into public or personal file repository, without a standardized method to express its metadata and data content, making it difficult to be found, accessed, interoperated, and reused.

Therefore, based on the Open Geospatial Consortium (OGC) standards baseline, the OGC Training Data Markup Language for AI (TrainingDML-AI) Standard Working Group (SWG) tried to develop the TD model and encoding methods to exchange and retrieve TD in the Web environment. The scope includes: how TD are prepared, how to specify different metadata used for different AI/ML tasks, how to differentiate the high-level TD information model and extended information models specific to various AI/ML applications. The work will describe the latest progress and status of the standard development.

The TrainingDML-AI conceptual model includes the most relevant entities of the TD covering from dataset to individual training samples and labels. It specifies how and into which parts of the TD should be decomposed and classified. The core concepts include: AI_TrainingDataset, which represents a collection of training samples; AI_TrainingData, which is an individual training sample in a TDS; AI_Task, which identifies what task the TDS is used for; AI_Label, which represents the label semantics for TD; AI_Labeling, which provides the provenance for the TD; AI_TDChangeset, which records TD changes between two TDS versions; DataQuality, which can be associated with the TDS to document its quality.

The TrainingDML-AI content model focuses on implementations with basic attributes defined for off-the-shelf deployment. Concepts related to the EO AI/ML applications are defined as additional elements. Six key components are highlighted:

  • Training Dataset/Data. AI_AbstractTrainingDataset indicates the TDS, while each training sample is represented as AI_AbstractTrainingData. AI_EOTrainingDataset and AI_EOTrainingData are defined to convey attributes specific to EO domain.
  • AI_EOTask is proposed by extending AI_AbstractTask to represent specific AI/ML tasks in the EO domain. The task type can refer to a particular type defined by an external category.
  • Labels for each individual training sample can be represented using features, coverages, or semantic classes. The AI_AbstractLabel is extended to specify AI_SceneLabel, AI_ObjectLabel, and AI_PixelLabel respectively.
  • AI_Labeling records basic provenance information on how to create the TDS. It includes the labeler and labeling procedure, which can be mapped to the agent and activity respectively in W3C PROV.
  • DataQuality and QualityElements defined in the ISO 19157-1 are used to align with the existing efforts on geographic data quality.
  • Change procedures of the TDS are documented in the AI_TDChangeset, which composes of changed training samples in the collection level.

Finally, use case scenarios and best practices are provided to illustrate intended use and benefits of TrainingDML-AI for EO AI/ML applications. Totally five different tasks are provided, covering scene classification, object detection, semantic segmentation, change detection and 3D model reconstruction. Some software implementations including pyTDML and LuojiaSet are also presented.

How to cite: Yue, P., Shangguan, B., and Ziebelin, D.: The OGC Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Standard, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-16998, https://doi.org/10.5194/egusphere-egu23-16998, 2023.

16:34–16:36
|
PICO2.4
|
EGU23-5394
|
ECS
|
On-site presentation
|
Maximilian Nölscher, Anne-Karin Cooke, Sandra Willkommen, Mariana Gomez, and Stefan Broda

In the field of spatial machine learning, access to high-quality data sets is a crucial factor in the success of any analysis or modeling project, especially in subsurface hydrology. However, finding and utilizing such data sets can be a challenging and time-consuming process. This is where AwesomeGeodataTable comes in. AwesomeGeodataTable aims to establish a community-maintained searchable table of data sets that are easily usable as predictors for spatial machine learning starting with the focus on subsurface hydrology. With its user-friendly interface and currently small but growing number of data sets, AwesomeGeodataTable will make it easier for researchers and practitioners to find and use the data they need for their work. It brings the usability of existing data set collections to a next level through adding features for filtering and searching meta information on data sets. This talk will introduce attendees to the AwesomeGeodataTable project, its goals and features, and how they can get involved in maintaining and extending its database and expanding its features and user experience. Overall, AwesomeGeodataTable is a valuable resource for anyone working in the field of spatial machine learning, and we hope to see it become a widely used and respected resource in the community.

How to cite: Nölscher, M., Cooke, A.-K., Willkommen, S., Gomez, M., and Broda, S.: AwesomeGeodataTable - Towards a community-maintained searchable table for data sets easily usable as predictors for spatial machine learning, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-5394, https://doi.org/10.5194/egusphere-egu23-5394, 2023.

16:36–16:38
|
PICO2.5
|
EGU23-10594
|
Virtual presentation
Automatic labeling of the Training Dataset for Individual and Group Activities Detection
(withdrawn)
Sepehr Honarparvar, Mahnoush Mohammadi Jahromi, Sara Saeedi, and Steve Liang
16:38–16:40
|
PICO2.6
|
EGU23-3254
|
ECS
|
On-site presentation
|
Kai Norman Clasen and Begüm Demir

As a result of advancements in satellite technology, archives of remote sensing (RS) images are continuously growing, providing a valuable source of information for monitoring the Earth's surface. Researchers construct well-designed and ready-to-use datasets from the plethora of RS images for the broader community to make it easier to develop and compare novel algorithms, models, and architectures to further deepen our understanding of our planet from space. However, the descriptions of these datasets are often published in scientific papers as PDF files with several limitations:

  • The target audience is typically domain experts familiar with scientific jargon;
  • The work is required to adhere to a specific page limit;
  • Once the document is published, it is difficult to update sections or to centralize discussions around it. 

To overcome these issues, here we introduce the concept of interactive dataset websites that aim at making the dataset and research based on it more accessible. With visual and interactive examples, users can see exactly how the data is structured and how the data can be used in different contexts. For example, when working with RS data, it is beneficial to get a quick overview of the geographical distribution. By providing more in-depth background information about data sources and product specifications, these websites can also help users understand the context in which the data was collected, how it might be relevant to their work, and how to avoid common pitfalls. Another important aspect of interactive dataset websites is the inclusion of example code for using, loading, and visualizing the data. Especially when working with RS images (e.g., multispectral, hyperspectral, synthetic aperture radar data, etc), it is often not trivial to visualize the data. Providing example code can be especially useful for researchers unfamiliar with the specific tools required to work with the data or to introduce to the community tools specifically written to make it easier to work with the dataset. Quick feedback can be vital, as it allows researchers to report problems or ask questions that the authors or community can address in an open and centralized manner. Creating these "living, ever-evolving documents" makes them an increasingly valuable resource for anyone working with the dataset, leading to more robust and reliable research.

It might seem daunting at first to create such an interactive dataset website, but due to recent open-source projects such as Executable Books (https://executablebooks.org/) and free hosting providers such as GitHub Pages (https://pages.github.com/), it has become relatively easy to produce and host such websites. The HTML content can be generated from Jupyter Notebooks, a tool that many researchers and data scientists are familiar with. To provide an example, in our talk we will showcase an interactive dataset website for the BigEarthNet-MM dataset, which you can find here: https://docs.kai-tub.tech/ben-docs/

How to cite: Clasen, K. N. and Demir, B.: Construction of Interactive Websites for Remote Sensing Datasets, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3254, https://doi.org/10.5194/egusphere-egu23-3254, 2023.

16:40–16:42
|
PICO2.7
|
EGU23-12352
|
ECS
|
On-site presentation
Max Hess, Aljoscha Rheinwalt, and Bodo Bookhagen

The global availability of dense point-clouds provides the potential to better assess changes in our dynamic world, particularly environmental changes and natural hazards. A core step to make use of modern point-clouds is to have a reliable classification and identify features of importance for a successful classification. However,  the quality of classification is affected by both the classifier and the complexity of the features which describe the classes. To address the limitations of classification performance, we attempt to answer the question: To what extent can a classifier learn the separation into different classes based on the available features in a given training dataset?

We compare several measures of class separability to assess the descriptive value of each feature. A ranked list is generated that includes all individual features as well as all possible combinations within specific groups. Selecting high-ranked features based on their descriptive value allows us to summarize datasets without losing essential information about the individual classes. This is an important step in processing existing training data or in setting priorities for future data collection.

In our application experiments, we compare geometric and echo-based features of lidar point-clouds to obtain the most useful sets of features for separating ground and vegetation points into their respective classes. Different scenarios of suburban and natural areas are studied to collect various insights for different classification tasks. In addition, we group features based on various attributes such as acquisition or computational cost and evaluate the benefits of these efforts in terms of a possible better classification result.

How to cite: Hess, M., Rheinwalt, A., and Bookhagen, B.: Point-Cloud Class Separability: Identifying the Most Discriminative Features, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12352, https://doi.org/10.5194/egusphere-egu23-12352, 2023.

16:42–16:44
|
PICO2.8
|
EGU23-7299
|
On-site presentation
François De Vieilleville, Adrien Lagrange, Nicolas Dublé, and Bertrand Le Saux

In the context of the project CORTEX, a study was carried out to build a method to generate synthetic images with associated labels for hyperspectral use cases. Such a method is interesting in the case where too few annotated data are available to train a deep neural network (DNN). The context of hyperspectral images is particularly suited for this problem since labeled datasets of hyperspectral images are scarce and generally of very small size.

Therefore, the first step of the project was to define an interesting hyperspectral use case to carry out the study. More concretely, generative models must be trained to achieve this objective. It means that a set of hyperspectral images and their associated ground truth are necessary to train the models. A dataset was created with PRISMA images associated with the IGN BD forest v2. The result is a segmentation dataset of 1268 images of size 256x256 pixels with 234 spectral bands. The associated ground truth includes 4 classes: not-forest, broad-leaved forest, coniferous forest and mixed forest. To correctly match the ground truth and the images, an important work was done about the improvement of the geolocalization of the PRISMA images by coregistering patches with Sentinel-2 images. We want to underline the interest of this database that remains from our knowledge one the few large scale HS database and is made available on the platform Zenodo.

Then, a segmentation model was trained with the dataset to assess its quality and the feasibility of the task of forest-type segmentation. Good results were obtained using a Unet-EfficientNet segmentation DNN. It showed that the dataset is coherent but the problem still difficult since the ‘mixed forest’ class remains challenging to identify.

Finally, an important research work was conducted to develop a Generative Adversarial Network method able to generate synthetic hyperspectral images. The state-of-the-art StyleGAN2 was modified to this purpose. An additional discriminator was added and tasked with the job of discriminating synthetic and real images in a reduced image space. Good results were obtained for the generation of 32-bands images, but the results worsen when increasing more the number of bands. The difficulty of the problem appears directly linked to the number of bands that we look to generate.

The final goal was to generate synthetic ground truth masks alongside the images and the method SemanticGAN was elected to address this problem. Since this method is based on StyleGAN2, the improvements of StyleGAN2 for HS images were included in the method. At the end, a modified version of SemanticGAN was proposed. The discriminator assessing the coherence between masks and images was modified to use an image of reduced dimension and a specific training strategy was introduced to help the convergence. The initial expectation was that the generation of masks would help stabilizing the generation of images, but the experiments showed the contrary. Early results are promising, but more research will be necessary to obtain couples of images and masks that could be used to train a DNN.

How to cite: De Vieilleville, F., Lagrange, A., Dublé, N., and Le Saux, B.: Towards generation of synthetic hyperspectral image datasets with GAN, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7299, https://doi.org/10.5194/egusphere-egu23-7299, 2023.

16:44–16:46
|
PICO2.9
|
EGU23-888
|
ECS
|
On-site presentation
Ivan Ferreira and Ardiansyah Koeshidayatullah

Recent advancement in deep generative models (GANs) has brought the attention of many researchers to explore the feasibility of using realistic synthetic data as (i) a digital twin of the original dataset and (ii) a new approach to augment the original dataset. Previous works highlighted that GANs can replicate both esthetical and statistical characteristics of datasets, up to the point of being indistinguishable from real samples, even when being examined by domain experts. In addition, the weights learned during the unsupervised training of these generative models are useful to further extract specific features of interest from the given dataset. In geosciences, many computer vision tasks are related to semantic segmentation, from pore quantification to fossil characterization. In such a task, the labeling process becomes the main limiting point, being both time-consuming and requiring domain experts. Hence, in this study, we repurpose GANs to obtain self-labeled geological datasets for semantic segmentation to be readily applicable in geological machine learning workflows. In this work, we used trained style-based GANs of foraminifera specimens, ooids, and mudstones. Our experiments show that with one or a few labels, we can successfully generate self-labeled and synthetic datasets featuring the labels of interest. This achievement is pivotal in geosciences in exploring the idea of GANs for one-shot and few-shot segmentation and in minimizing the efforts of manual labeling for segmentation requiring domain experts. 

How to cite: Ferreira, I. and Koeshidayatullah, A.: Generating self-labeled geological datasets for semantic segmentation using pretrained GANs, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-888, https://doi.org/10.5194/egusphere-egu23-888, 2023.

16:46–16:48
|
PICO2.10
|
EGU23-2203
|
On-site presentation
Young-Tae Son, Sang-Yeop Jin, and Tae-Soon Kang

Visual AI (artificial Intelligence) YOLOv5 algorithm was used in order to detect marine organisms from underwater images, and the test results showed an average high detection rate (>90%). As performance indicators of the AI model, both precision and recall showed very good performance, exceeding 0.95. So as to minimize the change in object detection performance according to the variation of underwater conditions, image correction was conducted, and more objects could be detected after image correction.

In order to determine which species the object detected in the video or image corresponds to, the performance was evaluated by AI learning classification model (YOLO-Classification), which is a deep learning algorithm (approximately 3% accuracy improved after image correction). We tried to identify the taxonomic species of organisms using deep learning, and although the number of target species was small, we achieved a classification accuracy of about 80% or more based on the data collected so far.

High-quality image DB data of the target species have to be established from a long-term perspective in order to accurately classify object (fish) species, and imaged taken from various angles of the target species must be collected simultaneously improve performance. 

As a prerequisite for measuring the size of an object detected in an image, MDE (Monocular Depth Estimation), a deep learning algorithm for estimating the depth of a mono camera image, was applied and the distance from a certain reference point was calculated with the MiDAS v3 algorithm. As a result of the MiDas v3 algorithm test, the excessive error has been reduced compared to before application and the distance measurement accuracy of up to 2m, which is longer than the guide stick length, has been obtained.

How to cite: Son, Y.-T., Jin, S.-Y., and Kang, T.-S.: Object detection and classification applying AI (computer vision) to underwater images, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-2203, https://doi.org/10.5194/egusphere-egu23-2203, 2023.

16:48–16:50
|
PICO2.11
|
EGU23-17570
|
Virtual presentation
Ioannis Vernikos, Georgios Giannopoulos, Aikaterini Christopoulou, Anxhelo Begaj, Marianthi Stefouli, Emmanuel Bratsolis, and Eleni Charou

Machine Learning (ML) algorithms had successfully contributed in the creation of automated methods of recognizing patterns in high-dimensional data. Remote sensing data  covers  wide  geographical areas and could be used to solve the problem of the demand of various  in-situ data.  Lithologicall mapping using remotely sensed data  is one of the most challenging  applications of ML algorithms. In the framework of the “AI for Geoapplications” project , ML and especially Deep Learning (DL) methodologies are investigated  for  the identification and characterization of the lithology based on remote sensing data in various  pilot areas  in Greece.  In order to train and test the various ML algorithms, a dataset consisting of  30 ROIs selected  mainly  from low -vegetated areas,  that cover 2% of the total  area of Greece was created . For each  ROI 

  • the corresponding shape file  with the lithological units
  • the corresponding  Sentinel2 (10 bands)  and/or Aster (14 bands) images

are provided

The dataset is  being publicly  available in the cloud along with the necessary code for visualization and processing.

How to cite: Vernikos, I., Giannopoulos, G., Christopoulou, A., Begaj, A., Stefouli, M., Bratsolis, E., and Charou, E.: A dataset of Earth Observation Data for Lithological Mapping using Machine Learning, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-17570, https://doi.org/10.5194/egusphere-egu23-17570, 2023.

16:50–16:52
|
PICO2.12
|
EGU23-10590
|
ECS
|
Virtual presentation
The impact of training dataset on a vision-based smart road sensor to measure the level of flood on the streets
(withdrawn)
Mahnoush Mohammadi Jahromi, Sepehr Honarparvar, Sara Saeedi, and Steve Liang
16:52–18:00