- 1Institute of Astronomy, Geophysics and Atmospheric Sciences, University of São Paulo, São Paulo, Brazil (maria.andrade@iag.usp.br)
- 2State Climate Office of North Carolina, North Carolina State University, Raleigh, USA (kdeavil@ncsu.edu)
- 3Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford, UK (erick.sperandio@surrey.ac.uk)
- 4Global Centre for Clean Air Research, University of Surrey, Guildford , UK (p.kumar@surrey.ac.uk )
Air pollution is one of the main environmental and public health challenges in urban and rural areas, influenced by a wide range of factors, including traffic, biomass burning, and meteorology. In Brazil, about 326,478 deaths occurred between 2019 and 2021 due to exposure to air pollution. About 8,400 deaths per year are attributed to the Metropolitan Area of São Paulo (MASP), the largest metropolitan area of South America. Mitigating the effects of air pollution is only possible with a deep understanding of the spatial and temporal distributions of air pollutants at high resolution. We employed a machine learning framework based on Extreme Gradient Boosting (XGBoost) to spatialize particulate matter concentrations (PM2.5 and PM10) at MASP at 300 × 300 m². In addition, we developed a Ridge regression model to control multicollinearity and ensure stable estimates. We used this model to examine monthly hospitalizations associated with air pollution and heat exposure in MASP during 2023–2024, a period marked by severe biomass burning and heat waves. The study used integrated data from the Environmental Company of the State of São Paulo (CETESB), ERA5 reanalysis, land use and land cover (MapBiomas), emission inventories, terrain roughness and altitude, and hospitalizations (National Health Data Network, DATASUS) from 2022 to 2024. The XGBoost model has shown to be robust, with high R² values of 0.85 for PM2.5 and 0.88 for PM10, and RMSE of 3.3 µg/m³ and 5.2 µg/m³, respectively, for the test set (30% of the data). The analysis showed higher pollution levels in densely populated and industrialized areas, such as Guarulhos-Pimentas and Parque Don Pedro, while less urbanized regions, such as Pico do Jaraguá, had lower concentrations due to meteorological and topographical factors. The Ridge distributed-lag hospitalization model exhibited high explanatory power (R² = 0.88; RMSE = 214 hospitalizations per month). Chronic cumulative exposure over three months revealed that ozone and nitrogen dioxide were the dominant drivers of hospitalizations, associated with increases of approximately 65% and 57%, respectively, in monthly hospitalizations, while PM10 showed a moderate effect (~16%). Carbon monoxide did not present a significant association. These findings indicate that photochemical pollution combined with seasonal and thermal variability plays a critical role in respiratory morbidity in MASP, providing a robust quantitative basis for environmental health surveillance and urban air-quality management.
How to cite: Franco, M. A., Cruz, D. D., Santibañez, J. A. P., Fernandes, K., Nascimento, E. G. S., Kumar, P., and Andrade, M. D. F.: Machine Learning and Statistical Modeling of Air Pollution and Hospitalizations in South America’s Largest Metropolitan Area, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-13242, https://doi.org/10.5194/egusphere-egu26-13242, 2026.