EGU25-2634, updated on 14 Mar 2025
https://doi.org/10.5194/egusphere-egu25-2634
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Wednesday, 30 Apr, 16:50–17:00 (CEST)
 
Room 2.17
Comparison of Models for Missing Data Imputation in Environmental Data: A Case Study of PM-2.5 in Seoul
Ju-Yong Lee1, Seung-Hee Han1, Kwon Jang2, Kyung-Hui Wang1, Hui-Young Yun2, and Dae-Ryun Choi2
Ju-Yong Lee et al.
  • 1Department of Environmental Engineering, Anyang University, Anyang, Gyeonggi, Republic of Korea. (juyong214@naver.com)
  • 2Department of Environmental and Energy Engineering , Anyang University, Anyang, Gyeonggi, Republic of Korea.

PM-2.5 is a critical pollutant for air quality evaluation and public health policymaking, necessitating accurate data for reliable analysis. However, environmental data often contain missing values due to equipment malfunctions or extreme weather conditions, which undermine the credibility of analysis and predictions. In particular, the frequent fluctuations of PM-2.5 levels in Seoul highlight the importance of addressing missing data issues.

This study systematically compares the performance of various missing data imputation methods for PM-2.5 data in Seoul, aiming to identify the optimal approach for medium- and long-term predictions. By generating and evaluating missing data during high- and low-concentration periods, this research differentiates itself from prior studies and enhances practical applicability.

A range of statistical and machine learning-based methods, including FFILL, KNN, MICE, SARIMAX, DNN, and LSTM, were applied to impute missing data. The performance of each method was evaluated over 6-hour, 12-hour, and 24-hour intervals using metrics such as RMSE, MAE, and correlation coefficients. The experimental design incorporated real-world air quality conditions by selecting data from periods of significant PM-2.5 variation.

KNN demonstrated balanced performance across all time intervals and yielded the best results for medium- and long-term predictions. FFILL showed excellent accuracy over short time intervals but exhibited declining performance as the interval length increased. Conversely, deep learning-based models, such as DNN and LSTM, showed relatively poor performance, indicating the need for further optimization to account for the characteristics of time-series data.

This study confirms that KNN is the most suitable method for PM-2.5 missing data imputation due to its simplicity and computational efficiency. These findings enhance the reliability of air quality data analysis and provide a valuable foundation for effective air quality management and policymaking. Furthermore, the results underscore the importance of selecting appropriate imputation methods to improve predictive accuracy and analytical reliability.

"This research was supported by Particulate Matter Management Specialized Graduate Program through the Korea Environmental Industry & Technology Institute(KEITI) funded by the Ministry of Environment(MOE)“

 

How to cite: Lee, J.-Y., Han, S.-H., Jang, K., Wang, K.-H., Yun, H.-Y., and Choi, D.-R.: Comparison of Models for Missing Data Imputation in Environmental Data: A Case Study of PM-2.5 in Seoul, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-2634, https://doi.org/10.5194/egusphere-egu25-2634, 2025.