Spatio-temporal clustering on a high-performance computing platform for high-resolution monitoring network analysis
- 1Government of Canada, Environment and Climate Change Canada, North York, Canada (colin.lee@canada.ca)
- 2NILU – Norwegian Institute for Air Research
Air quality monitoring networks provide invaluable data for studying human health, environmental impacts, and the effects of policy changes, but obtaining high quality data can be costly, with each site in a monitoring network requiring instrumentation and skilled operator time. It is therefore important to ensure that each monitor in the network is providing unique data to maximize the value of the entire network. Differences in measurement approaches for the same chemical between monitoring stations may also result in discontinuities in the network data. Both of these factors suggest the need for objective, machine-learning methodologies for monitoring network data analysis.
Air quality models are another valuable tool to augment monitoring networks. The models simulate air quality over a large region where monitoring may be sparse. The gridded output from air-quality models thus contain inherent information on the similarity of sources, chemical oxidation pathways and removal processes for chemicals of interest, provided appropriate tools are available to identify these similarities on a gridded basis. The output from these models can be immense, again requiring the use of special, highly optimized tools for post-processing analysis.
Spatiotemporal clustering is a family of techniques that have seen widespread use in air quality, whereby time-series taken at different locations are grouped based on the level of similarity between time-series data within the dataset. Hierarchical clustering is one such algorithm, which has the advantage of not requiring an a priori assumption about how many clusters there might be (unlike K-means). However, traditional approaches for hierarchical clustering become computationally expensive as the number of time-series increases in size, resulting in prohibitive computational costs when the total number of time-series to be compared rises above 30,000, even on a supercomputer. Similarly, the comparison and clustering of large numbers of discrete data (such as multiple mass spectrometer data sampled at high time resolution from a moving laboratory platform) becomes computationally prohibitive using conventional methods.
In this study we present a high-performance hierarchical clustering algorithm which is able to run in parallel over many nodes on massively parallel computer systems, thus allowing for efficient clustering for very large monitoring network and model output datasets. The new high-performance program is able to cluster 290,000 annual time series (from either monitoring network data or gridded model output) in 13 hours on 800 nodes. We present here some example results showing how the algorithm can be used to analyse very large datasets, providing new insights into “airsheds” depicting regions of similar chemical origin and history, different spatial regimes for nitrogen, sulphur, and base cation deposition, . These analyses show how different processes control each species at different potential monitoring site locations, via cluster-generated airshed maps for each species. The efficiency and flexibility of the algorithm allows for extremely large datasets to be analysed in hours of wall-clock time instead of weeks or months. The new algorithm is being used as the numerical engine for a new tool for the analysis of EU monitoring network data.
How to cite: Lee, C., Makar, P., and Soares, J.: Spatio-temporal clustering on a high-performance computing platform for high-resolution monitoring network analysis, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-8841, https://doi.org/10.5194/egusphere-egu23-8841, 2023.