Features from Multispectral Drone Data: Curating, training and distributing Transformers for all

Jurrian Doornbos; Önder Babur

doi:https://doi.org/10.5194/egusphere-egu25-1534

[Back] [Session ESSI1.9]

EGU25-1534, updated on 14 Mar 2025

https://doi.org/10.5194/egusphere-egu25-1534

EGU General Assembly 2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Poster | Monday, 28 Apr, 10:45–12:30 (CEST), Display time Monday, 28 Apr, 08:30–12:30

Hall X4, X4.63

Features from Multispectral Drone Data: Curating, training and distributing Transformers for all

Jurrian Doornbos¹ and Önder Babur^1,2

Jurrian Doornbos and Önder Babur

¹Wageningen University, Information Technology, Wageningen, Netherlands (jurrian.doornbos@wur.nl)
²Department of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands (onder.babur@wur.nl)

Uncrewed aerial vehicles (UAVs) have been identified as an important tool supporting more detailed remote sensing applications compared to satellite-based platforms, from agriculture to forest monitoring and sensing mountains. The insights the UAV can offer, due to its flexibility, high precision and sensor variety, are far beyond the previous approaches to measure the health of forests, yield in field crops, and even rockfall risk. The flexibility however also poses a problem, flight conditions, sensor types, flight height and angles all affect the generalization of developed approaches using UAVs. These supervised approaches also rely on large amounts of human-labelled datasets. A pathway to reduce high-label requirements is to utilize unsupervised training with Vision Transformers (ViTs). Pretrained Vision Transformers on large datasets generalize well to unseen data, with only supervised few samples required to specify the application. However, these models are often trained on massive web-scraped RGB datasets. Furthermore, RGB-ViTs miss the infrared domain to handle crucial vegetation information. Finally, UAV imagery is exclusively from the aerial perspective, this is missing in existing pretraining datasets.

We present an openly available, pre-trained Vision Transformer specifically for UAV multispectral imagery across various domains. Furthermore, various downstream applications such as canopy height modelling and semantic segmentation are evaluated and compared against RGB baselines. The main contribution is the openly available training dataset, and the pre-trained models, with recipes for finetuning a task-specific head.

The dataset is built around multispectral image contributions from the ICAERUS Drone Data Analytics Library and an additional database search on Zenodo and Data in Brief (Table 1.). This is followed by a quality check after, including radiometric calibration, and spectral alignment. Furthermore, all data is quantized into 16-bit float and sliced into smaller 224x224 chips with four channels (Green, Red, Red Edge and NIR). A summary of included datasets is presented in Table 1. DINOv2-s and DINOv2-b were chosen for the architecture as there is much available documentation and provide a state-of-the-art vision foundation model. The training was done in minibatches of size 32, for 6 days on two V100 GPUs.

Early experiments suggest that the pre-trained models outperform existing DINOv2-s and DINOv2-b pre-trained foundation models in both the clarity of the features, as well as tuned on UAV-specific tasks (canopy height modelling, and semantic segmentation).

Table 1. Included datasets for pretraining, total size on disk is 399GB

How to cite: Doornbos, J. and Babur, Ö.: Features from Multispectral Drone Data: Curating, training and distributing Transformers for all, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-1534, https://doi.org/10.5194/egusphere-egu25-1534, 2025.