OGC Testbed-18 Machine Learning Training Datasets Task: Application of standards to Machine Learning training datasets
- 1Pixalytics Ltd, Plymouth, United Kingdom (slavender@pixalytics.com)
- 2FrontierSI, Australia
- 3Curtin University, Australia
Training datasets are a crucial component of any machine learning approach, with significant human effort spent creating and curating these for specific applications. However, a historical absence of standards has resulted in inconsistent and heterogeneous training datasets with limited discoverability and interoperability. Therefore, there is a need for best practices and guidelines for generating, structuring, describing, and curating training datasets.
The Open Geospatial Consortium (OGC) Testbed-18 initiative covered several topics related to geospatial data, focussing on issues around cataloguing and interoperability. Within Testbed-18, the Machine Learning Training Datasets task aimed to develop a foundation for future standardization of training datasets for Earth observation applications.
For this task, members from Pixalytics, FrontierSI, and Curtin University authored an Engineering Report that reviewed:
· Examples of how training datasets have been used in Earth observation applications
· The current best-practice methods for documenting training datasets
· The various requirements for training dataset metadata
· How the Findability, Accessibility, Interoperability, and Reuse (FAIR) principles apply to training datasets
The Engineering Report provides a foundation that OGC can leverage in creating the future standard for machine learning training data for Earth observation applications. The Engineering Report also provides a useful overview of the state of work and key considerations for anyone wishing to improve how they document their training datasets.
In our presentation, we discuss the key findings from the Engineering Report, including key metadata identified from Earth observation use cases, the current state of the art, thoughts on cataloguing and describing training data quality, and how the FAIR principles apply to training data.
How to cite: Lavender, S., Adams, C., Ivánová, I., and Williams, K.: OGC Testbed-18 Machine Learning Training Datasets Task: Application of standards to Machine Learning training datasets, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-3493, https://doi.org/10.5194/egusphere-egu23-3493, 2023.