Correcting PM2.5 data from low-cost sensors using machine learning techniques

Pratyush Agrawal; Srishti Srishti; Padmavati Kulkarni; Hrishikesh Gautam; Meenakshi Kushwaha; Sreekanth Vakacherla; Pratima Singh

doi:https://doi.org/10.5194/egusphere-egu23-3110

[Back] [Session AS5.13]

EGU23-3110

https://doi.org/10.5194/egusphere-egu23-3110

EGU General Assembly 2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Correcting PM_2.5 data from low-cost sensors using machine learning techniques

Pratyush Agrawal¹, Srishti Srishti¹, Padmavati Kulkarni¹, Hrishikesh Gautam¹, Meenakshi Kushwaha², Sreekanth Vakacherla¹, and Pratima Singh¹

Pratyush Agrawal et al.

¹Center for Study of Science, Technology, and Policy, Bengaluru, India
²ILK Labs, Bengaluru, India

Low-cost sensors (LCSs) used for measuring air quality have become popular because of their portability, affordability, and ease of operation. However, LCS data often have accuracy and bias issues that need to be addressed before using them for research. LCSs are, therefore, collocated with reference-grade instruments, and various statistical and machine learning (ML) approaches are used to correct the observed bias in data. In this study, collocation experiments were conducted in Bengaluru, India, for about nine months (December 2021 to August 2022). We used nine PM_2.5 LCSs that were collocated with a beta attenuation monitor (BAM), which is certified by the United States Environmental Protection Agency (USEPA). Hourly averaged data from LCSs and BAM were used to train various ML correction models. The LCSs included in the study—Airveda, Atmos, Prana Air, BlueSky, Aurassure, Aerogram, PurpleAir, and Prkruti—are widely available in the Indian market. The ML models include support vector regression (SVR), decision tree (DT), random forest (RF), and eXtreme gradient boosting (XGBoost). For the LCSs used in the study, a total of 170 ML models were built to identify the best-performing correction model for each sensor. Model performances were evaluated based on the following metrics: mean absolute error (MAE), root mean square error (RMSE), and normalised RMSE (NRMSE). During the study period, the average hourly BAM concentration was ~32 µg/m³. Hourly averaged PM_2.5 from LCSs and BAM exhibited a linear relationship. The NRMSE values of the raw (uncorrected) LCSs PM_2.5 with respect to BAM PM_2.5 varied between 0.26 and 0.89 across various sensors. The Plantower-based LCSs (Atmos I, PurpleAir, and Aerogram) performed better, characterised by the lowest RMSE/NRMSE values. SVR was found to be the best-performing model for most of the sensors in correcting raw LCSs PM_2.5data. The NRMSE of the ML models’ corrected LCSs PM_2.5 was reduced by 46% to 74% across various sensors compared to the uncorrected LCSs PM_2.5. As a case study, we also added black carbon (BC) data to our ML models, but no significant change (improvement by 6% RMSE) in performance was observed.

How to cite: Agrawal, P., Srishti, S., Kulkarni, P., Gautam, H., Kushwaha, M., Vakacherla, S., and Singh, P.: Correcting PM2.5 data from low-cost sensors using machine learning techniques, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-3110, https://doi.org/10.5194/egusphere-egu23-3110, 2023.