EGU24-6107, updated on 08 Mar 2024
https://doi.org/10.5194/egusphere-egu24-6107
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

A multi-modal high spatial resolution aerial imagery scene classification model with visual enhancement

Lin He1,2, Yi Lin1,2, and Yufei Song1,2
Lin He et al.
  • 1College of surveying and geo-informatics, Tongji University
  • 2Research Center for Remote Sensing Technology Application of Tongji University

Remote sensing image scene classification is to annotate semantic categories for image areas covering multiple land cover types, reflecting the spatial aggregation of relevant social resources among feature objects, which is one of the remote sensing interpretation tasks with higher challenges for algorithms to understand the images. Nowadays, scene semantic information extraction of images using deep neural networks is also one of the hot research directions. In comparison to other algorithms, deep neural networks can better capture semantic information in images to achieve higher classification accuracy involved in applications such as urban planning. In recent years, multi-modal models represented by image-text have achieved satisfactory performance in downstream tasks. The introduction of "multi-modal" in the field of remote sensing research should not be limited to the use of multi-source data, but more importantly to the coding of diverse data and the extracted deep features based on the huge amount of data. Therefore, in this paper, based on an image-text matching model, we establish a multi-modal scene classification model (Fig. 1) for high spatial resolution aerial images which is dominated by image features and text provides facilitation for the representation of image features. The algorithm first employs self-supervised learning of the visual model, to align the expression domain of the image features obtained from training on natural images with that of our particular dataset, which will help to improve the feature extraction effectiveness of the aerial survey images on the visual model. The features generated by the pre-trained image encoding model and the text encoding model will be further aligned and some of the parameters in the image encoder will be iteratively updated during training. A valid classifier is designed at the end of the model to implement the scene classification task. Through experiments, it was found that our algorithm has a significant improvement effect on the task of scene categorization on aerial survey images compared to single visual models. The model presented in the article obtained precision and recall of above 90% on the test dataset, contained in the high spatial resolution aerial survey images dataset we built with 27 categories (Fig. 2).

Fig 1. Diagram of the proposed model structure. Blue boxes are associated with the image, green boxes with the text, and red boxes with both image and text.

Fig 2. Samples in our high spatial resolution aerial survey images dataset.

How to cite: He, L., Lin, Y., and Song, Y.: A multi-modal high spatial resolution aerial imagery scene classification model with visual enhancement, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-6107, https://doi.org/10.5194/egusphere-egu24-6107, 2024.