- University of Naples Federico II, Department of Earth Sciences, Environment and Resources, Naples, Italy (nicola.scafetta@unina.it)
Air pollutants (PM2.5, PM10, O3, NO2, SO2, and CO) have become a significant environmental concern, particularly in Asian metropolises. The Indian metropolis of Delhi serves as a prime example. A key challenge in addressing urban atmospheric pollution is the significant variation in pollution levels across short distances. Various factors, including nearby industries, vehicle traffic, and population density influence this heterogeneity. Thus, to accurately assess the urban pollution situation, installing multiple pollution monitoring stations that comprehensively cover an entire city, from its center to its periphery is essential. However, the number of stations monitoring entire urban areas increases gradually. For instance, the World Air Quality Historical Database currently lists only four stations in Delhi that have been operational since 2014. The number of stations in Delhi has gradually increased, and in 2024, the same database recorded 45 operational stations, providing more comprehensive coverage of the city. However, the time covered by the available pollution records varies, and, within these periods, there are numerous missing data points. The inconsistencies between stations introduce statistical artifacts when local pollution data are averaged to produce decade-long records that might be representative of the whole city area, making it difficult to assess the actual urban air quality and the effects of policies aimed at reducing urban air pollution. In this study, we propose statistical reconstructions of six daily atmospheric pollution concentrations for all 45 stations in the city of Delhi from 2014 to the present. These reconstructions aim to produce a more consistent database that could better represent the entire city area over 11 years (from January 1, 2014, to January 1, 2025). This reconstructed network is then used to evaluate an ensemble average record that could more realistically represent the daily evolution of air pollution concentration in the city of Delhi since 2014. To accomplish such network reconstruction, we apply the Regression Learner tool in MATLAB to assess 35 machine learning (ML) regression techniques. We select and use only those that perform better in modeling the available records to estimate the missing data. The ML regression models that demonstrated superior performance include: the Fine Tree (Regression Trees family), the Bagged Trees (Ensembles of Trees family), the Optimizable Ensemble (Ensembles of Trees family), the Fine Gaussian SVM (Support Vector Machine family), the Rational Quadratic (Gaussian Process Regressions family), and the Exponential (Gaussian Process Regressions family). In contrast, our analysis revealed that the commonly used multi-linear regression model underperforms compared to 20 other ML regression models. Generally, the proposed methodology can apply to all situations typically addressed in the literature using the multi-linear regression model only because its algorithm is readily available. However, the physical relationships between a given observable and its potential constructors are often nonlinear, rendering the multi-linear regression model suboptimal for such tasks. In the case of the city of Delhi, we demonstrate that the proposed analysis methodology corrects significant biases in the decadal trend for all six network pollution records, and show that from 2014 to 2024, air pollution quality has slightly improved.
How to cite: Scafetta, N. and Shafi, S.: Optimal reconstruction of incomplete urban pollution records with machine learning regression models: a case study for Delhi, India, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5078, https://doi.org/10.5194/egusphere-egu25-5078, 2025.