The OGC Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Standard
- Wuhan University, Wuhan, China (pyue@whu.edu.cn)
The development of Artificial Intelligence (AI), especially Machine Learning (ML) technology, has injected new vitality into the geospatial domain. Training Data (TD) plays a fundamental role in geospatial AI/ML. They are key items for training, validating, and testing AI/ML models. At present, open access Training Datasets (TDS) are usually packaged into public or personal file repository, without a standardized method to express its metadata and data content, making it difficult to be found, accessed, interoperated, and reused.
Therefore, based on the Open Geospatial Consortium (OGC) standards baseline, the OGC Training Data Markup Language for AI (TrainingDML-AI) Standard Working Group (SWG) tried to develop the TD model and encoding methods to exchange and retrieve TD in the Web environment. The scope includes: how TD are prepared, how to specify different metadata used for different AI/ML tasks, how to differentiate the high-level TD information model and extended information models specific to various AI/ML applications. The work will describe the latest progress and status of the standard development.
The TrainingDML-AI conceptual model includes the most relevant entities of the TD covering from dataset to individual training samples and labels. It specifies how and into which parts of the TD should be decomposed and classified. The core concepts include: AI_TrainingDataset, which represents a collection of training samples; AI_TrainingData, which is an individual training sample in a TDS; AI_Task, which identifies what task the TDS is used for; AI_Label, which represents the label semantics for TD; AI_Labeling, which provides the provenance for the TD; AI_TDChangeset, which records TD changes between two TDS versions; DataQuality, which can be associated with the TDS to document its quality.
The TrainingDML-AI content model focuses on implementations with basic attributes defined for off-the-shelf deployment. Concepts related to the EO AI/ML applications are defined as additional elements. Six key components are highlighted:
- Training Dataset/Data. AI_AbstractTrainingDataset indicates the TDS, while each training sample is represented as AI_AbstractTrainingData. AI_EOTrainingDataset and AI_EOTrainingData are defined to convey attributes specific to EO domain.
- AI_EOTask is proposed by extending AI_AbstractTask to represent specific AI/ML tasks in the EO domain. The task type can refer to a particular type defined by an external category.
- Labels for each individual training sample can be represented using features, coverages, or semantic classes. The AI_AbstractLabel is extended to specify AI_SceneLabel, AI_ObjectLabel, and AI_PixelLabel respectively.
- AI_Labeling records basic provenance information on how to create the TDS. It includes the labeler and labeling procedure, which can be mapped to the agent and activity respectively in W3C PROV.
- DataQuality and QualityElements defined in the ISO 19157-1 are used to align with the existing efforts on geographic data quality.
- Change procedures of the TDS are documented in the AI_TDChangeset, which composes of changed training samples in the collection level.
Finally, use case scenarios and best practices are provided to illustrate intended use and benefits of TrainingDML-AI for EO AI/ML applications. Totally five different tasks are provided, covering scene classification, object detection, semantic segmentation, change detection and 3D model reconstruction. Some software implementations including pyTDML and LuojiaSet are also presented.
How to cite: Yue, P., Shangguan, B., and Ziebelin, D.: The OGC Training Data Markup Language for Artificial Intelligence (TrainingDML-AI) Standard, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-16998, https://doi.org/10.5194/egusphere-egu23-16998, 2023.