Addressing Training Data Challenges to Accelerate Earth Science Machine Learning


Source of most progress in artificial intelligence (AI) and machine learning (ML) can be traced back to data. Data, specifically, large-scale and openly-accessible training data are critical in adoption and acceleration of ML. While there are successful applications of ML in Earth science, the wider adoption of ML has been limited. Access to high-quality labeled training data is required to entice ML practitioners to tackle supervised learning problems in Earth science. However, creating labeled data that scales to support ML models is still a bottleneck and new strategies to increase the size and diversity of training datasets need to be explored. Additionally, enabling discovery and open sharing of existing training data and corresponding models to enable reproducibility of research and minimize duplication is a challenge.

This session seeks submissions from ML practitioners and data curators using different approaches to create labeled training data, catalog training data and models, and provide search, discovery and distribution of training data and models.

Co-organized by GI2
Convener: Manil MaskeyECSECS | Co-conveners: Hamed AlemohammadECSECS, Anirudh KoulECSECS, Rahul Ramachandran, Nicolas Longépé
vPICO presentations
| Wed, 28 Apr, 15:30–16:15 (CEST)

vPICO presentations: Wed, 28 Apr

Chairpersons: Hamed Alemohammad, Manil Maskey, Anirudh Koul
Benjamin Kellenberger, Devis Tuia, and Dan Morris

Ecological research like wildlife censuses increasingly relies on data on the scale of Terabytes. For example, modern camera trap datasets contain millions of images that require prohibitive amounts of manual labour to be annotated with species, bounding boxes, and the like. Machine learning, especially deep learning [3], could greatly accelerate this task through automated predictions, but involves expansive coding and expert knowledge.

In this abstract we present AIDE, the Annotation Interface for Data-driven Ecology [2]. In a first instance, AIDE is a web-based annotation suite for image labelling with support for concurrent access and scalability, up to the cloud. In a second instance, it tightly integrates deep learning models into the annotation process through active learning [7], where models learn from user-provided labels and in turn select the most relevant images for review from the large pool of unlabelled ones (Fig. 1). The result is a system where users only need to label what is required, which saves time and decreases errors due to fatigue.

Fig. 1: AIDE offers concurrent web image labelling support and uses annotations and deep learning models in an active learning loop.

AIDE includes a comprehensive set of built-in models, such as ResNet [1] for image classification, Faster R-CNN [5] and RetinaNet [4] for object detection, and U-Net [6] for semantic segmentation. All models can be customised and used without having to write a single line of code. Furthermore, AIDE accepts any third-party model with minimal implementation requirements. To complete the package, AIDE offers both user annotation and model prediction evaluation, access control, customisable model training, and more, all through the web browser.

AIDE is fully open source and available under



How to cite: Kellenberger, B., Tuia, D., and Morris, D.: Introducing AIDE: a Software Suite for Annotating Images with Deep and Active Learning Assistance, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-12065,, 2021.

Rudy Venguswamy, Mike Levy, Anirudh Koul, Satyarth Praveen, Tarun Narayanan, Ajay Krishnan, Jenessa Peterson, Siddha Ganju, and Meher Kasam

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack. 

We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them. 

In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers’ specified “training budget.”). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image’s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p ≈ 0.5).

Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.

By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers’ data curation efforts.

How to cite: Venguswamy, R., Levy, M., Koul, A., Praveen, S., Narayanan, T., Krishnan, A., Peterson, J., Ganju, S., and Kasam, M.: Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-6853,, 2021.

Jason Meil

Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.
Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. 
The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. 

How to cite: Meil, J.: Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-16326,, 2021.

Myroslava Lesiv, Dmitry Schepaschenko, Martina Dürauer, Marcel Buchhorn, Ivelina Georgieva, and Steffen Fritz

Spatially explicit information on forest management at a global scale is critical for understanding the current status of forests for sustainable forest management and restoration. Whereas remotely sensed based datasets, developed by applying ML and AI algorithms, can successfully depict tree cover and other land cover types, it has not yet been used to depict untouched forest and different degrees of forest management. We show for the first time that with sufficient training data derived from very high-resolution imagery a differentiation within the tree cover class of various levels of forest management is possible.

In this session, we would like to present our approach for labeling forest related training data by using Geo-Wiki application ( Moreover, we would like to share a new open global training data set on forest management we collected from a series of Geo-Wiki campaigns. In February 2019, we organized an expert workshop to (1) discuss the variety of forest management practices that take place in different parts of the world; (2) generalize the definitions for the application at global scale; (3) finalize the Geo-Wiki interface for the crowdsourcing campaigns; and (4) build a data set of control points (or the expert data set), which we used later to monitor the quality of the crowdsourced contributions by the volunteers. We involved forest experts from different regions around the world to explore what types of forest management information could be collected from visual interpretation of very high-resolution images from Google Maps and Microsoft Bing, in combination with Sentinel time series and Normalized Difference Vegetation Index (NDVI) profiles derived from Google Earth Engine (GEE). Based on the results of this analysis, we expanded these campaigns by involving a broader group of participants, mainly people recruited from remote sensing, geography and forest research institutes and universities.

In total, we collected forest data for approximately 230 000 locations globally. These data are of sufficient density and quality and therefore could be used in many ML and AI applications for forests at regional and local scale.  We also provide an example of ML application, a remotely sensed based global forest management map at a 100 m resolution (PROBA-V) for the year 2015. It includes such classes as intact forests, forests with signs of human impact, including clear cuts and logging, replanted forest, woody plantations with a rotation period up to 15 years, oil palms and agroforestry. The results of independent statistical validation show that the map’s overall accuracy is 81%.

How to cite: Lesiv, M., Schepaschenko, D., Dürauer, M., Buchhorn, M., Georgieva, I., and Fritz, S.: Collecting training data to map forest management at global scale, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-15297,, 2021.

Craig Warren, Iraklis Giannakis, and Antonios Giannopoulos

Lack of well-labelled and coherent training data is the main reason why machine learning (ML) and data-driven interpretations are not established in the field of Ground-Penetrating Radar (GPR). Non-representative and limited datasets lead to non-reliable ML-schemes that overfit, and are unable to compete with traditional deterministic approaches. To that extent, numerical data can potentially complement insufficient measured datasets and overcome this lack of data, even in the presence of large feature spaces.

Using synthetic data in ML is not new and it has been extensively applied to computer vision. Applying numerical data in ML requires a numerical framework capable of generating synthetic but nonetheless realistic datasets. Regarding GPR, such a framework is possible using gprMax, an open source electromagnetic solver, fine-tuned for GPR applications [1], [2], [3]. gprMax is fully parallelised and can be run using multiple CPU’s and GPU’s. In addition, it has a flexible scriptable format that makes it easy to generate big data in a trivial manner. Stochastic geometries, realistic soils, vegetation, targets [3] and models of commercial antennas [4], [5] are some of the features that can be easily incorporated in the training data.

The capability of gprMax to generate realistic numerical datasets is demonstrated in [6], [7]. The investigated problem is assessing the depth and the diameter of rebars in reinforced concrete. Estimating the diameter of rebars using GPR is particularly challenging with no conclusive solution. Using a synthetic training set, generated using gprMax, we managed to effectively train ML-schemes capable of estimating the diameter of rebar in an accurate and efficient manner [6], [7]. The aforementioned case studies support the premise that gprMax has the potential to provide realistic training data to applications where well-labelled data are not available, such as landmine detection, non-destructive testing and planetary sciences.


[1] Warren, C., Giannopoulos, A. & Giannakis, I., (2016). gprMax: Open Source software to simulate electromagnetic wave propagation for Ground Penetrating Radar, Computer Physics Communications, 209, 163-170.

[2] Warren, C., Giannopoulos, A., Gray, A., Giannakis, I., Patterson, A., Wetter, L. & Hamrah, A., (2018). A CUDA-based GPU engine for gprMax: Open source FDTD, electromagnetic simulation software. Computer Physics Communications, 237, 208-218.

[3] Giannakis, I., Giannopoulos, A. & Warren, C. (2016). A realistic FDTD numerical modeling framework of Ground Penetrating Radar for landmine detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 9(1), 37-51.

[4] Giannakis, I., Giannopoulos, A. & Warren, C., (2018). Realistic FDTD GPR antenna models optimized using a novel linear/non-linear full waveform inversion. IEEE Transactions on Geoscience and Remote Sensing, 207(3), 1768-1778.

[5] Warren, C., Giannopoulos, A. (2011). Creating finite-difference time-domain models of commercial ground-penetrating radar antennas using Taguchi’s optimization method. Geophysics, 76(2), G37-G47

[6] Giannakis, I., Giannopoulos, A. & Warren, C. (2021). A Machine Learning Scheme for  Estimating the Diameter of Reinforcing Bars Using Ground Penetrating Radar. IEEE Geoscience and Remote Sensing Letters.

[7] Giannakis, I., Giannopoulos, A., & Warren, C. (2019). A machine learning-based fast-forward solver for ground penetrating radar with application to full-waveform inversion. IEEE Transactions on Geoscience and Remote Sensing. 57(7), 4417-4426.

How to cite: Warren, C., Giannakis, I., and Giannopoulos, A.: gprMax: An Open Source Electromagnetic Simulator for Generating Big Data for Ground Penetrating Radar Applications, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-10347,, 2021.

Surya Ambardar, Siddha Ganju, and Peter Jenniskens

Meteor showers are some of the most dazzling and memorable events occuring in the night sky. Caused by bits of celestial debris from comets and asteroids entering Earth’s atmosphere at astronomical speeds, meteors are bright streaks of light in the night sky, sometimes called shooting stars. Those meteors are recorded, tracked and triangulated by low-light surveillance cameras in a project called CAMS: Cameras for Allsky Meteor Surveillance. CAMS offers insights into a universe of otherwise invisible solar system bodies, but that task has proven difficult due to the lack of automated supervision. Until recently, much of the data control was done by hand. Necessary to build supervised classification models,  labeled training data is essential because other man-made objects such as airplanes and satellites can be mistaken for meteors. To address this issue, we leverage one year's worth of meteor activity data from CAMS to provide weak supervision for over a decade of collected data, drastically reducing the amount of manual annotation necessary and expanding available labelled meteor training data.


Founded in 2010, CAMS aims to automate video surveillance of the night sky to validate the International Astronomical Union’s Working List of Meteor Showers, discover new meteor showers, and predict future meteor showers. Since 2010, CAMS has collected a decade's worth of night sky activity data in the form of astrometric tracks and brightness profiles, a year of which has been manually annotated. We utilize this one year's labelled data to train a high confidence LSTM meteor classifier to generate low confidence labels for the remaining decade’s worth of meteor data. Our classifier yields confidence levels for each prediction, and when the confidence lies above a statistically significant threshold, predicted labels can be treated as weak supervision for future training runs. Remaining predictions below the threshold can be manually annotated. Using a high threshold minimizes label noise and ensures instances are correctly labeled while considerably reducing the  amount of data that needs to be annotated. Weak supervision can be confirmed by checking date ranges and data distributions for known meteor showers to verify predicted labels.


To encourage discovery and distribution of training data and models, we additionally provide scripts to automate data ingestion and model training from raw camera data files. The data scripts handle processing of CAMS data, providing a pipeline to encourage open sharing and reproduction of our research. Additionally, we provide code for a LSTM classifier baseline model which can identify probable meteors. This baseline model script allows further exploration of CAMS data and an opportunity to experiment with other model types.  


In conclusion, our contributions are (1) a weak supervision method utilizing a year’s worth of labelled CAMS data to generate labels for a decade’s worth of data, along with (2) baseline data processing and model scripts to encourage open discovery and distribution. Our unique contributions expand access to labeled training meteor data and make the data globally and publicly accessible thorough daily generated maps of meteor shower activity posted at 

How to cite: Ambardar, S., Ganju, S., and Jenniskens, P.: It's a Bird it's a Plane it's a Meteor, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-6490,, 2021.

Francesco Asaro, Gianluca Murdaca, and Claudio Maria Prati

This work presents a methodology to improve supervised learning of segmentation tasks for convolutional architectures in unbalanced and weak labeled synthetic aperture radar (SAR) dataset scenarios, which characterize the Earth Observation (EO) domain. The presented methodology exploits multitemporality and stochasticity to regularize training by reducing overfitting and thus improving validation and test performances.

Traditional precisely annotated datasets are made of patches extracted from a set of image-label pairs, often in a deterministic fashion. Through a set of experiments, we show that this approach is sub-optimal when using weak labels since it leads to early overfitting, mainly because weak labels only mark the simplest features of the target class.

The presented methodology builds up the dataset from a multitemporal stack of images aligned with the weakly labeled ground truth and samples the patches both in time and space. The patches are selected only if a given condition of the positive class frequency is met. We show learning improvements against the traditional methodology by applying our strategy to a benchmark task, which consists of training a typical deep convolutional network, Unet (Ronneberger et al, 2015), for the segmentation of water surfaces in SAR images.

The dataset sources are Sentinel-1, calibrated sigma zero, VV-VH polarized, single-look, intensity images for the inputs, and the Copernicus’s “Water and Wetness High Resolution Layer” for the weak labels. To avoid spatial autocorrelation phenomena, the training set covers the Low Countries (Belgium, the Netherlands, and Luxembourg), while the validation and test-set span the Padana plain area (Italy). The training dataset is built up according to the methodology, while the validation and test datasets are defined in a deterministic fashion as usual.

We show the beneficial effects of multitemporality, stochasticity, and conditional selection in three different sets of experiments, as well as in a combined one. In particular, we observe performance improvements in terms of the F-1 score, which increases together with the degree of multitemporality (number of images in the stack), as well as when stochasticity and conditional rules that compensate the under-representation of the positive class are added. Furthermore, we show that in the specific framework of SAR data, the introduction of multitemporality improves the learned representation of the speckle, thus implicitly optimizing the Unet for both the filtering and segmentation tasks. We prove this by comparing the number of looks of the input patch to that of the patch reconstructed before the classification layer.

Overall, in this framework, we show that solely using the presented training strategy, the classifier's performance improves up to 5% in terms of the F-1 score.

How to cite: Asaro, F., Murdaca, G., and Prati, C. M.: Conditional spatio-temporal random crop for weak labeled SAR datasets, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-11957,, 2021.

Christian Schroeder de Witt, Catherine Tong, Valentina Zantedeschi, Daniele De Martini, Alfredo Kalaitzis, Matthew Chantry, Duncan Watson-Parris, and Piotr Bilinski

Climate change is expected to aggravate extreme precipitation events, directly impacting the livelihood of millions. Without a global precipitation forecasting system in place, many regions – especially those constrained in resources to collect expensive ground station data – are left behind. To mitigate such unequal reach of climate change, a solution is to alleviate the reliance on numerical models (and by extension ground station data) by enabling machine-learning-based global forecasts from satellite imagery. Though prior works exist in regional precipitation nowcasting, there lacks work in global, medium-term precipitation forecasting. Importantly, a common, accessible baseline for meaningful comparison is absent. In this work, we present RainBench, a multi-modal benchmark dataset dedicated to advancing global precipitation forecasting. We establish baseline tasks and release PyRain, a data-handling pipeline to enable efficient processing of decades-worth of data by any modeling framework. Whilst our work serves as a basis for a new chapter on global precipitation forecasting from satellite imagery, the greater promise lies in the community joining forces to use our released datasets and tools in developing machine learning approaches to tackle this important challenge.

How to cite: Schroeder de Witt, C., Tong, C., Zantedeschi, V., De Martini, D., Kalaitzis, A., Chantry, M., Watson-Parris, D., and Bilinski, P.: RainBench: Enabling Data-Driven Precipitation Forecasting on a Global Scale, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-1762,, 2021.

Octavian Dumitru, Gottfried Schwarz, Mihai Datcu, Dongyang Ao, Zhongling Huang, and Mila Stillman

During the last years, much progress has been reached with machine learning algorithms. Among the typical application fields of machine learning are many technical and commercial applications as well as Earth science analyses, where most often indirect and distorted detector data have to be converted to well-calibrated scientific data that are a prerequisite for a correct understanding of the desired physical quantities and their relationships.

However, the provision of sufficient calibrated data is not enough for the testing, training, and routine processing of most machine learning applications. In principle, one also needs a clear strategy for the selection of necessary and useful training data and an easily understandable quality control of the finally desired parameters.

At a first glance, one could guess that this problem could be solved by a careful selection of representative test data covering many typical cases as well as some counterexamples. Then these test data can be used for the training of the internal parameters of a machine learning application. At a second glance, however, many researchers found out that a simple stacking up of plain examples is not the best choice for many scientific applications.

To get improved machine learning results, we concentrated on the analysis of satellite images depicting the Earth’s surface under various conditions such as the selected instrument type, spectral bands, and spatial resolution. In our case, such data are routinely provided by the freely accessible European Sentinel satellite products (e.g., Sentinel-1, and Sentinel-2). Our basic work then included investigations of how some additional processing steps – to be linked with the selected training data – can provide better machine learning results.

To this end, we analysed and compared three different approaches to find out machine learning strategies for the joint selection and processing of training data for our Earth observation images:

  • One can optimize the training data selection by adapting the data selection to the specific instrument, target, and application characteristics [1].
  • As an alternative, one can dynamically generate new training parameters by Generative Adversarial Networks. This is comparable to the role of a sparring partner in boxing [2].
  • One can also use a hybrid semi-supervised approach for Synthetic Aperture Radar images with limited labelled data. The method is split in: polarimetric scattering classification, topic modelling for scattering labels, unsupervised constraint learning, and supervised label prediction with constraints [3].

We applied these strategies in the ExtremeEarth sea-ice monitoring project ( As a result, we can demonstrate for which application cases these three strategies will provide a promising alternative to a simple conventional selection of available training data.

[1] C.O. Dumitru et. al, “Understanding Satellite Images: A Data Mining Module for Sentinel Images”, Big Earth Data, 2020, 4(4), pp. 367-408.

[2] D. Ao et. al., “Dialectical GAN for SAR Image Translation: From Sentinel-1 to TerraSAR-X”, Remote Sensing, 2018, 10(10), pp. 1-23.

[3] Z. Huang, et. al., "HDEC-TFA: An Unsupervised Learning Approach for Discovering Physical Scattering Properties of Single-Polarized SAR Images", IEEE Transactions on Geoscience and Remote Sensing, 2020, pp.1-18.

How to cite: Dumitru, O., Schwarz, G., Datcu, M., Ao, D., Huang, Z., and Stillman, M.: Improved Training for Machine Learning: The Additional Potential of Innovative Algorithmic Approaches., EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-4683,, 2021.

Alastair McKinstry, Oisin Boydell, Quan Le, Inder Preet, Jennifer Hanafin, Manuel Fernandez, Adam Warde, Venkatesh Kannan, and Patrick Griffiths

The ESA-funded AIREO project [1] sets out to produce AI-ready training dataset specifications and best practices to support the training and development of machine learning models on Earth Observation (EO) data. While the quality and quantity of EO data has increased drastically over the past decades, availability of training data for machine learning applications is considered a major bottleneck. The goal is to move towards implementing FAIR data principles for training data in EO, enhancing especially the finability, interoperability and reusability aspects.  To achieve this goal, AIREO sets out to provide a training data specification and to develop best practices for the use of training datasets in EO. An additional goal is to make training data sets self-explanatory (“AI-ready) in order to expose challenging problems to a wider audience that does not have expert geospatial knowledge. 

Key elements that are addressed in the AIREO specification are granular and interoperable metadata (based on STAC), innovative Quality Assurance metrics, data provenance and processing history as well as integrated feature engineering recipes that optimize platform independence. Several initial pilot datasets are being developed following the AIREO data specifications. These pilot applications include for example  forest biomass, sea ice detection and the estimation of atmospheric parameters.An API for the easy exploitation of these datasets will be allow the Training Datasets (TDS) to work against EO catalogs (based on OGC STAC catalogs and best practises from ML community) to allow updating and updated model training over time.


This presentation will present the first version of the AIREO training dataset specification and will showcase some elements of the best-practices that were developed. The AIREO compliant pilot datasets will be presented which are openly accessible and community feedback is explicitly encouraged. 


How to cite: McKinstry, A., Boydell, O., Le, Q., Preet, I., Hanafin, J., Fernandez, M., Warde, A., Kannan, V., and Griffiths, P.: AI-Ready Training Datasets for Earth Observation: Enabling FAIR data principles for EO training data., EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-12384,, 2021.