EGU25-8714, updated on 14 Mar 2025
https://doi.org/10.5194/egusphere-egu25-8714
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Friday, 02 May, 15:15–15:25 (CEST)
 
Room -2.92
Leveraging multi-modal classification of historical aerial images and topographic maps to derive past land cover
Mareike Dorozynski1, Franz Rottensteiner1, Thorsten Dahms2, and Michael Hovenbitzer2
Mareike Dorozynski et al.
  • 1Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Hanover, Germany ({dorozynski, rottensteiner}@ipi.uni-hannover.de)
  • 2Bundesamt für Kartographie und Geodäsie, Frankfurt am Main, Germany ({thorsten.dahms, michael.hovenbitzer}@bkg.bund.de)

For the analysis of the evolution of landscapes, it is required to determine not only current states of the Earth’s surface, but to gain knowledge about past states, too. Sources of information for historic land cover are historic remote sensing imagery, and scanned historic topographic maps. To make the contained information explicitly available for subsequent computer-aided spatio-temporal analysis, classification techniques can be exploited. Against this background, multi-modal land cover classification from maps and aerial orthoimages is developed in the context of the Gauss Center (Gauss Centre, 2025), aiming to benefit from both the textural and geometrical details contained in aerial images, as well as the small intra-class variability in topographic maps.

The proposed deep learning-based classifier is a variant of a UPerNet (Xiao et al., 2018) with four down-sampling stages and takes aerial orthoimagery and topographic maps of the same epoch as an input. Each input modality is processed by an individual encoder, either based on convolutions, e.g. a ResNet (He et al., 2016), or on attention mechanisms, e.g. a Swin Transformer (Liu et al., 2022). This results in uni-modal map features and uni-modal aerial image features at four levels of detail. As the aerial images provide finer details about the texture and boundaries of the land cover objects, the aerial features of the first three stages are directly presented to the decoder, while the highest level aerial image features are fused with those of the topographic maps in a mid-level fusion. To focus on the most relevant features of the two modalities in spatial and feature dimension both, locally and globally, features are weighted by attention weights that are learned following the strategy in (Song et al., 2022). The lower-level aerial features and the high-level multi-modal features are presented to the decoder to predict to multi-modal land cover.

Experiments are conducted on two multi-modal datasets; one for binary building classification and one for multi-class vegetation classification. Both datasets consist of pixel-aligned aerial orthoimages, topographic maps and reference data at a ground sampling distance of 1 m. For all experiments, weights obtained in a pre-training on ImageNet (Russakovsky et al., 2015) are selected for the two encoder branches, while all remaining network weights are randomly initialized based on variance scaling (He et al., 2015). Training is proceeded utilizing the ADAM optimizer (Kingma & Ba, 2015) with standard parameters and a learning rate of 10-2 until the validation F1-score does not improve for 30 epochs. For both datasets, multi-modal predictions are compared to uni-modal predictions. Furthermore, attention-based feature extraction is compared to the one based on convolutions. The achieved mean F1-socres are the highest for the multi-modal variants of the classifier, where a higher score of 90.1% can be achieved utilizing convolutions on the building dataset (multi-modal, attention: 86.9%; aerial, convolution: 89.2%; map, convolution: 84.6%) and attentions are to be preferred for vegetation classification, resulting in a mean F1-score of 83.0% (multi-modal, convolution: 82.2%; aerial, attention: 82.1%; map, attention: 54.0%).

How to cite: Dorozynski, M., Rottensteiner, F., Dahms, T., and Hovenbitzer, M.: Leveraging multi-modal classification of historical aerial images and topographic maps to derive past land cover, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-8714, https://doi.org/10.5194/egusphere-egu25-8714, 2025.