A Framework for Bootstrapping Event Classification Datasets
- Graz University of Technology, Engineering Geodesy and Measurement Systems, Graz, Austria (dumitru@tugraz.at)
Machine learning models require large amounts of training samples: in most cases, thousands of unique instances are required for each class of events.
Constructing such a dataset is an extremely expensive task in terms of the time it takes to identify and label meaningful events.
Additionally, the large amount of data produced by DAS interrogator units makes storage prohibitively expensive: the mediums need to be both fast (storing one second of data should take less than one second), and of high capacity (spanning long periods of time, such that rarely-occurring events of interest can be captured). To alleviate these constraints, a typical approach is to only store "interesting" events. But how to know which events to store without having a classifier in the first place?
To solve this chicken-and-egg problem, we propose a framework for constructing datasets used to train classifiers of events detectable through the acoustic fingerprinting of DAS measurement data.
Our framework progressively builds up a dataset starting from one or more hard-coded anomaly detection rules (even as simple as energy thresholding) to solve the initial problem of managing limited storage space. This creates an intermediate dataset of unlabeled events.
The intermediate dataset is then evaluated by means of acoustic fingerprinting, which assigns a feature vector to each event. To facilitate further user input, the feature vectors are projected onto a two-dimensional space. Supervised clustering is performed on the projected representations, where users can select the granularity of the clustering process, and consequently the number of resulting intermediate classes.
By assigning labels to entire clusters instead of individual samples, the time required to annotate the dataset is significantly reduced.
We evaluate the outcomes of this framework on a dataset of more than 50000 events initially detected on an inner-city measurement line, and discuss the possibilities of developing more sophisticated models that can be bootstrapped from this approach.
How to cite: Dumitru, V., Strasser, L., and Lienhart, W.: A Framework for Bootstrapping Event Classification Datasets, Galileo conference: Fibre Optic Sensing in Geosciences, Catania, Italy, 16–20 Jun 2024, GC12-FibreOptic-18, https://doi.org/10.5194/egusphere-gc12-fibreoptic-18, 2024.