Cross-Attention Multimodal Learning Using Image and Audio for Rainfall Intensity Estimation

Chun-Chen Lin; Hao-Che Ho

doi:https://doi.org/10.5194/egusphere-egu26-2660

[Back] [Session HS6.7]

EGU26-2660, updated on 13 Mar 2026

https://doi.org/10.5194/egusphere-egu26-2660

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Cross-Attention Multimodal Learning Using Image and Audio for Rainfall Intensity Estimation

Chun-Chen Lin¹ and Hao-Che Ho²

Chun-Chen Lin and Hao-Che Ho

¹Department of Civil Engineering, National Taiwan University, Taipei City, Taiwan(timfrog0521@gmail.com)
²Department of Civil Engineering, National Taiwan University, Taipei City, Taiwan

Extreme rainfall events have become more frequent and intense under climate change, presenting increasing challenges for hydrological monitoring and flood risk management. High-resolution rainfall observations are essential for capturing the spatial and temporal variability of storm events, yet conventional rain-gauge networks suffer from limited spatial coverage and cannot resolve rapidly evolving convective structures. Moreover, high-intensity rainfall events are inherently rare in natural settings, resulting in data gaps in upper rainfall categories. To address this limitation, we integrate natural rainfall observations with controlled artificial rainfall experiments to construct a comprehensive and balanced multi-class dataset covering 0–70 mm/hr at 5 mm/hr intervals. We develop a multimodal deep learning framework that jointly leverages rainfall imagery and acoustic measurements for rainfall-intensity estimation. The two sensing modalities provide complementary physical information: imagery captures streak morphology, drop density, and spatial distribution patterns, while acoustics encode drop momentum, kinetic energy, and impact signatures. Neither modality alone fully characterizes rainfall processes across all intensity ranges; by combining them, the model benefits from richer and more discriminative features. Two-second audio segments are converted into log-mel spectrograms, and a Cross-Attention fusion mechanism enables the network to selectively emphasize the most informative cues from each modality for different rainfall categories. Image-based data augmentation such as horizontal flipping further expands the training space and improves model generalization.

Compared with previous studies that relied on single-modality inputs or coarse categorical schemes, our framework achieves a substantially finer classification resolution (0–70 mm/hr in 5-mm/hr bins) and exhibits improved discrimination between adjacent intensity levels. The multimodal architecture consistently outperforms single-modality baselines, with the performance gains being particularly notable in the moderate-to-heavy rainfall range, where the model achieves higher classification accuracy, highlighting the benefits of true cross-modal complementarity. The integration of artificial and natural rainfall further produces a balanced and physically representative dataset that captures both controlled high-intensity scenarios and real-world variability.Overall, this study demonstrates the potential of multimodal sensing and deep learning to advance rainfall monitoring capabilities. The proposed non-contact, low-cost, and high-resolution approach offers a promising pathway for enhancing rainfall observation in regions with sparse gauge coverage, strengthening flood early warning systems, and supporting real-time hydrological applications under a changing climate.

How to cite: Lin, C.-C. and Ho, H.-C.: Cross-Attention Multimodal Learning Using Image and Audio for Rainfall Intensity Estimation, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-2660, https://doi.org/10.5194/egusphere-egu26-2660, 2026.