Virtual aggregations to improve scientific ETL and data analysis for datasets from the Earth System Grid Federation
- Instituto de Física de Cantabria (IFCA), CSIC - Universidad de Cantabria, Santander, Spain
The ESGF Virtual Aggregation (EVA) is a new data workflow approach that aims to advance the sharing and reuse of scientific climate data stored in the Earth System Grid Federation (ESGF). The ESGF is a global infrastructure and network of internationally distributed research centers that together work as a federated data archive, supporting the distribution of global climate model simulations of the past, current and future climate. The ESGF provides modeling groups with nodes for publishing and archiving their model outputs to make them accessible to the climate community at any time. The standardization of the model output in a specified format, and the collection, archival and access of the model output through the ESGF data replication centers have facilitated multi-model analyses. Thus, ESGF has been established as the most relevant distributed data archive for climate data, hosting the data for international projects such as CMIP and CORDEX. As of 2022 it includes more than 30 PB of data distributed across research institutes all around the globe and it is the reference archive for Assessment Reports (AR) on Climate Change produced by the Intergovernmental Panel on Climate Change (IPCC). However, explosive data growth has confronted the climate community with a scientific scalability issue. Conceived as a distributed data store, the ESGF infrastructure is designed to keep file sizes manageable for both sysadmins and end users. However, use cases in scientific research often involve calculations on datasets spanning multiple variables, over the whole time period and multiple model ensembles. In this sense, the ESGF Virtual Aggregation extends the federation capabilities, beyond file search and download, by providing out of the box remote climate data analysis capabilities over data analysis ready, virtually aggregated, climate datasets, on top of the existing software stack of the federation. In this work we show an analysis that serves as a test case for the viability of the data workflow and provides the basis for discussions on the future of the ESGF infrastructure, contributing to the debate on the set of reliable core services upon which the federation should be built.
Acknowledgements
This work it’s been developed under support from IS-ENES3 which is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824084.
This work it’s been developed under support from CORDyS (PID2020-116595RB-I00) funded by MCIN/AEI/10.13039/501100011033.
How to cite: Cimadevilla, E., Iturbide, M., and Cofiño, A. S.: Virtual aggregations to improve scientific ETL and data analysis for datasets from the Earth System Grid Federation, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-16117, https://doi.org/10.5194/egusphere-egu23-16117, 2023.