- Indian Institute of Technology Kharagpur, Indian Institute of Technology Kharagpur, Energy and Urban Research Group, Ranbir and Chitra Gupta School of Infrastructure Design and Management, Kharagpur, India
Semantic segmentation is the foundation of a wide range of practical applications, such as urban planning, climate modeling, and environmental protection, all of which have direct socio-economic implications. However, the accelerating densification of metropolitan regions in developing countries complicates accurate mapping of fine-scale urban land uses, as three-band optical imagery often fails to capture spectral variability and the restricted capacity of the CNN-based model to establish spatial and inter-band relationships. Therefore, to address these limitations, we propose a multi-modal architecture built on a SegFormer-B2 backbone. The pipeline integrates auxiliary datasets of DEM for surface information, SWIR for capturing water absorption characteristics, and an ancillary dataset of built-up layers for enhanced urban boundary delineation, along with multi-temporal false-color composites from LISS-4 and Sentinel-2 over the Bengaluru region. The proposed framework integrates convolutional feature extraction with transformer attention to jointly learn local spectral–spatial patterns and global cross-band dependencies. Attention-guided up-sampling, a hybrid loss function, and cross-attention modules are incorporated to strengthen feature fusion across heterogeneous modalities by establishing a link between the multi-band synergy of the Auxiliary and Ancillary datasets. Empirical evaluation reveals consistent qualitative improvement and higher overall accuracy, with substantial gains for Barren land when incorporating SWIR and vegetation, and when integrating DEM. These results validate the effectiveness of the proposed framework in overcoming spectral insufficiency and spatial ambiguity, as it outperforms baseline models. Overall, the proposed approach offers a scalable and transferable solution for private developers and government agencies seeking robust, fine-resolution mapping to support a sustainable and structured urban environment.
Keywords: Urban mapping, Deep learning architecture, Spectral Feature extraction, Performance Optimization
How to cite: Kumar, V. and Haridas Aithal, B.: Contextual Aware Hybrid Deep learning framework: Assessment with Auxiliary and Ancillary Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1347, https://doi.org/10.5194/egusphere-egu26-1347, 2026.