Automatic identification of ensembles of critical futures in large datasets

Amal Sarfraz; Charles Rougé; Lyudmila Mihaylova; Jonathan Lamontagne; Abigail Birnbaum; Flannery Dolan

doi:https://doi.org/10.5194/egusphere-egu24-15767

[Back] [Session NH10.6]

EGU24-15767, updated on 09 Mar 2024

https://doi.org/10.5194/egusphere-egu24-15767

EGU General Assembly 2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Automatic identification of ensembles of critical futures in large datasets

Amal Sarfraz

^1,3,5, Charles Rougé

¹, Lyudmila Mihaylova

², Jonathan Lamontagne³, Abigail Birnbaum³, and Flannery Dolan⁴

Amal Sarfraz et al.

¹Department of Civil and Structural Engineering, The University of Sheffield, Sheffield, United Kingdom of Great Britain (asarfraz1@sheffield.ac.uk)
²Department of Automatic Control and Systems Engineering, The University of Sheffield, Sheffield, United Kingdom of Great Britain, United Kingdom of Great Britain
³Department of Civil and Environmental Engineering, Tufts University, Medford, Massachusetts, United States
⁴RAND Corporation, Santa Monica, California, United States
⁵Institute of Environmental Sciences and Engineering, School of Civil and Environmental Engineering, National University of Sciences and Technology, Islamabad, Pakistan

In climate risk modelling, the growing trend of simulating large ensembles is driven by the need to understand a wide range of possible future scenarios. This approach generates vast datasets, which presents a challenge: identifying the most critical scenarios that could have significant impacts. While mainstream data patterns offer general insights, outliers provide unique perspectives, specifying areas for further investigation. However, focusing on single outliers is not optimal. Instead, analysing groups of outliers enables a more comprehensive exploration for the identification of patterns in multiple plausible future outcomes.

In this context, we introduce the term ensemble of outliers to describe groups of data points deviating significantly from the mean of the dataset. An ensemble of outliers can help uncover underlying patterns and highlight areas for deeper exploration. These ensembles of outliers, once identified can possess distinct properties and indicate phenomena that are not represented in the rest of the dataset.

Our research proposes a new method to address the challenge of identifying these ensembles of outliers within large datasets. Our methodology, Mahalanobis distance-based Ensemble of Outlier Detection (MEOD) includes Gaussian Mixture Models for probabilistic clustering coupled with Enhanced Mahalanobis distance-based statistical analysis to identify an ensemble of outliers in complex large datasets. MEOD's efficiency is validated through extensive testing on thousands of synthetic datasets, encompassing diverse configurations of both the dataset and an ensemble of outlier characteristics. The results indicate a high degree of accuracy for MEOD, with an average purity of 99.65% and an average F1 score of 0.92.

To demonstrate the utility of MEOD to climate risk assessment, we implement our method on a large dataset of future agricultural production scenarios for the Indus River Basin (IRB). This large dataset was generated using an Integrated Assessment Model, Global Change Analysis Model and encompasses 3,000 scenarios outlining potential socioeconomic, water supply-demand, and land use changes up to the century's end. Our goal is to use MEOD to identify and analyse a critical ensemble of outliers that significantly drives water scarcity in IRB's agricultural sector. We successfully identified 150 scenarios as an ensemble of outliers, characterised by their unique socioeconomic attributes and agricultural practices.

These scenarios predominantly fall into two categories: 1) those involving increased competition for resources due to regional disparities and 2) those incorporating a mix of sustainable and conventional agricultural practices. This dichotomy highlights both overuse and intensive water resource utilisation scenarios, signalling significant agricultural withdrawals and high scarcity risks.

Our findings demonstrate the MEOD's efficiency as a robust, versatile tool for analysing complex, large-scale datasets, providing nuanced insights into intricate data patterns.

How to cite: Sarfraz, A., Rougé, C., Mihaylova, L., Lamontagne, J., Birnbaum, A., and Dolan, F.: Automatic identification of ensembles of critical futures in large datasets, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-15767, https://doi.org/10.5194/egusphere-egu24-15767, 2024.

Comments on the supplementary material

AC: Author Comment | CC: Community Comment | Report abuse

supplementary materials version 1 – uploaded on 17 Apr 2024, no comments