ESSI2.7
Meeting Exascale Computing Challenges with Compression and Pangeo

ESSI2.7

EDI
Meeting Exascale Computing Challenges with Compression and Pangeo
Convener: Charles Zender | Co-conveners: Tina Odaka, Mario Echeverri, Denise Degen, Daniel Caviedes-Voullième
Presentations
| Wed, 25 May, 13:20–18:30 (CEST)
 
Room 0.51

Presentations: Wed, 25 May | Room 0.51

Chairpersons: Denise Degen, Tina Odaka, Mario Echeverri
13:20–13:25
13:25–13:31
|
EGU22-5709
|
Virtual presentation
|
Anne Fouilloux, Yvan Le Bras, and Adele Zaini

Pangeo has been deployed on a number of diverse infrastructures and learning resources are available with for instance the Pangeo Tutorial Gallery (http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/index.html). However, knowledge of Python is necessary to develop or reuse applications with the Pangeo ecosystem which hinders its wider adoption and reduces potential inter-disciplinary collaborations. 

Our main objective is to reduce barriers for using the Pangeo ecosystem and allow everyone to understand the fundamental concepts behind Pangeo and offer a Pangeo deployment for teaching and developing reproducible, reusable and fully automated workflows.

Most Pangeo tutorials and examples use Jupyter notebooks but the gap between these “toy examples” and real complex applications is still huge: adopting best software practices for Jupyter notebooks and big applications is essential for reuse and automation of workflows.

Galaxy project is a worldwide community dedicated to making tools, workflows and infrastructures open and accessible to everyone. Each tool in Galaxy has a wrapper describing the tool itself along with the input and output parameters, citations, and possible annotations thanks to EDAM ontology. Galaxy workflows are also annotated and can contain any kind of Galaxy Tools, including interactive tools such as Pangeo notebooks. 

Galaxy is also accessible via a web-based interface. The platform is designed to be community and technology agnostic and has gained adoption in various communities, ranging from Climate Science and Biodiversity to Biology and Medicine. 

By combining Pangeo and Galaxy, we are providing access to the Pangeo ecosystem to everyone, including those who are not familiar with Python and we offer fully automated and annotated Pangeo “tools”. 

Two main set of tools are currently available in Galaxy:

  • Pangeo notebook (synced with Pangeo notebook with corresponding docker https://github.com/pangeo-data/pangeo-docker-images) 
  • Xarray tools to manipulate and visualise netCDF data from Galaxy Graphical User Interface.

Training material is being developed and  included in the Galaxy Training Network (https://training.galaxyproject.org/):

  • “Pangeo ecosystem 101 for everyone - Introduction to Xarray Galaxy Tools” where anyone can learn about Pangeo and its main concepts and try it out without using any command lines;
  • Pangeo Notebook in Galaxy - Introduction to Xarray:itl is very similar to “Xarray Tutorial” from Pangeo (http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/xarray.htm) but makes use of Galaxy Pangeo notebooks and offers a different entry point to Pangeo.

Galaxy Training Infrastructure as a Service (https://galaxyproject.eu/tiaas.html) with infrastructure at no cost is provided by Galaxy Europe for  teachers/instructors. It was used for the FORCeS eScience course “Tools in Climate Science: Linking Observations with Modeling” (https://galaxyproject.eu/posts/2021/11/13/tiaas-anne/) where about 30 students learned about Pangeo (see https://nordicesmhub.github.io/forces-2021/intro.html).

Galaxy Pangeo also contributes to the worldwide online training “GTN Smörgåsbord” (last event 14-18 March 2022, https://gallantries.github.io/posts/2021/12/14/smorgasbord2-tapas/) where everyone is welcome as a trainee, trainer or just observer! This will contribute to democratising Pangeo.

How to cite: Fouilloux, A., Le Bras, Y., and Zaini, A.: Pangeo for everyone with Galaxy, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-5709, https://doi.org/10.5194/egusphere-egu22-5709, 2022.

13:31–13:37
|
EGU22-3739
|
ECS
|
On-site presentation
|
Alejandro Coca-Castro, Scott Hosking, and The Environmental Data Science Community

With the plethora of open data and computational resources available, environmental data science research and applications have accelerated rapidly. Therefore, there is an opportunity for community-driven initiatives compiling and classifying open-source research and applications across environmental systems (polar, oceans, forests, agriculture, etc). Building upon the Pangeo Gallery, we propose The Environmental Data Science book (https://the-environmental-ds-book.netlify.app), a community-driven online resource showcasing and supporting the publication of data, research and open-source developments in environmental sciences. The target audience and early adopters are i) anyone interested in open-source tools for environmental science; and ii) anyone interested in reproducibility, inclusive, shareable and collaborative AI and data science for environmental applications. Following FAIR principles, the resource provides multiple features such as guidelines, templates, persistent URLs and Binder to facilitate a fully documented, shareable and reproducible notebooks. The quality of the published content is ensured by a transparent reviewing process supported by GitHub related technologies. To date, the community has successfully published five python-based notebooks: two forest-, two wildfires/savanna- and one polar-related research. The notebooks consume common Pangeo stack e.g. intake, iris, xarray, hvplot for interactive visualisation and modelling from Environmental sensor data. In addition to constant feature enhancements of the GitHub repository https://github.com/alan-turing-institute/environmental-ds-book, we expect to increase inclusivity (multiple languages), diversity (multiple backgrounds) and activity (collaboration and coworking sessions) towards improving scientific software practises in the environmental science community.

How to cite: Coca-Castro, A., Hosking, S., and Community, T. E. D. S.: Environmental Data Science Book: a community-driven resource showcasing open-source Environmental science, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3739, https://doi.org/10.5194/egusphere-egu22-3739, 2022.

13:37–13:43
|
EGU22-13028
|
ECS
|
Virtual presentation
|
Timothy Lam, Alberto Arribas, Gavin Shaddick, Theo McCaie, and Jennifer Catto

The Pangeo project enables interactive, reproducible and scalable environmental research to be carried out using an integrated data-computational platform. Here we demonstrate a few examples that utilise a Pangeo platform on Microsoft Azure supported by the Met Office where global environmental challenges are explored and tackled collaboratively. They include: (1) Analysing and quantifying drivers of low rainfall anomalies during boreal summer in Indonesian Borneo using causal inference and causal network to identify key teleconnections, and their possible changes under a warming climate, which will contribute to seasonal forecasting efforts to strengthen prevention and control of drought and fire multihazards over peatlands in the study region; (2) Quantifying and communicating uncertainty in volcanic ash forecasts; and (3) Exploring the cascading effects that follow the degradation and recovery of Caribbean coral reefs.

How to cite: Lam, T., Arribas, A., Shaddick, G., McCaie, T., and Catto, J.: Using a Pangeo platform on Azure to tackle global environmental challenges, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13028, https://doi.org/10.5194/egusphere-egu22-13028, 2022.

13:43–13:49
|
EGU22-11729
|
Presentation form not yet defined
Julien Le Sommer and Takaya Uchida and the SWOT Adopt-A-Crossover Ocean Model Intercomparison Project Team

With an increase in computational power, ocean models with kilometer-scale resolution have emerged over the last decade. Using these realistic simulations, we have been able to quantify the energetic exchanges between spatial scales and inform the design of eddy parametrizations. The increase in resolution, however, has drastically increased model outputs, making it difficult to transfer and analyze the data. The realism of individual models in representing the energetics down to numerical dissipation has also come into question. Here, we showcase a cloud-based analysis framework proposed by the Pangeo Project that aims to tackle such distribution and analysis challenges. We analyze seven submesoscale permitting simulations all on the cloud at a crossover region of the upcoming SWOT altimeter mission near the Gulf Stream separation. The models used in this study are based on the NEMO, CROCO, MITgcm, HYCOM, FESOM and FIO-COM code bases. The cloud-based analysis framework: i) minimizes the cost of duplicating and storing ghost copies of data, and ii) allows for seamless sharing of analysis results amongst collaborators. In this poster, we will describe the framework and provide preliminary results (e.g. spectra, vertical buoyancy flux, and how it compares to predictions from the mixed-layer instability parametrization). Basin-to-global scale, submesoscale-permitting models are still at their early stage of development ; their cost and carbon footprints are also rather large. It would, therefore, benefit the community to compile the different model configurations for future best practices. We also believe that an emphasis on data analysis strategies would be crucial for improving the models themselves.



How to cite: Le Sommer, J. and Uchida, T. and the SWOT Adopt-A-Crossover Ocean Model Intercomparison Project Team: Intercomparison of basin-to-global scale submesoscale-permitting ocean models at SWOT cross-overs, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11729, https://doi.org/10.5194/egusphere-egu22-11729, 2022.

13:49–13:55
|
EGU22-7593
|
ECS
|
Virtual presentation
Basile Goussard

NetCarbon, a brand new french startup company, is offering farmers a free solution for measuring and monetizing their sequestered carbon to contribute towards carbon neutrality. This solution is relying on satellite data (Sentinel 2, Landsat 8 & PlanetScope) and open-source ecosystems such as the Pangeo software stack

 

The challenge in NetCarbon’s solution is the deployment of earth observation insights at scale. And be able to shift between cloud providers or on-premise architecture if needed. The best tool for us is up-to-now PANGEO.  

 

An example of our pangeo usage will be shown in the following three points.   

1°) Connection to satellite data / Extract 

2°) Processing satellite data at scale / Transform

3°) Saving the data within a data warehouse / Load

 

First, some of the building blocks to search for satellite data based on STAC will be shown. Moreover, the stackstac package will be tested to convert STAC into xarray, allowing researchers and companies to create their datacubes with all the metadata inside. 

 

The second part of the presentation will involve the computation layer. Indeed, computations algorithms like filtering by cloud cover, applying cloud mask, computing the land surface temperature, and applying an interpolation will be run. Land surface temperature is one data needed for the NetCarbon algorithm. The result of these previous steps will lead us to retrieve a dask computation graph. This computation graph will be run at scale within the cloud, based on Dask and Coiled. 

 

To conclude, the output of the processing part (spatial and temporal mean of the land surface temperature) will be displayed within a notebook and finally, the data will be loaded into a data warehouse (google bigquery). 

 

All the steps will be demonstrated in a reproducible notebook

How to cite: Goussard, B.: How to turn satellite data to insights at scale, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7593, https://doi.org/10.5194/egusphere-egu22-7593, 2022.

13:55–14:01
|
EGU22-4566
|
ECS
|
On-site presentation
|
Franziska Hellmuth, Anne Claire Mireille Fouilloux, Trude Storelvmo, and Anne Sophie Daloz

Cloud feedbacks are a major contributor to the spread of climate sensitivity in global climate models (GCMs) [1]. Among the most poorly understood cloud feedbacks is the one associated with the cloud phase, which is expected to be modified with climate change [2]. Cloud phase bias, in addition, has significant implications for the simulation of radiative properties and glacier and ice sheet mass balances in climate models.  

In this context, this work aims to expand our knowledge on how the representation of the cloud phase affects snow formation in GCMs. Better understanding this aspect is necessary to develop climate models further and improve future climate predictions. 

This study will compare surface snowfall, ice, and liquid water content from the Coupled Model Intercomparison Project Phase 6 (CMIP 6) climate models (accessed through Pangeo) to the European Centre for Medium-Range Weather Forecast Re-Analysis 5 (ERA5) data from 1985 to 2014. We conduct statistical analysis at the annual and seasonal timescales to determine the biases in cloud phase and precipitation (liquid and solid) in the CMIP6 models and their potential connection between them. 

For the analysis, we use the Jupyter notebook on the CMIP6 analysis (https://github.com/franzihe/eosc-nordic-climate-demonstrator/blob/master/work/), which guides the user step by step. The use of the Pangeo.io intake package makes it possible to browse the CMIP6 online catalog for the required variables, models, and experiments and stores it in xarray dask datasets. Vertical variables in sigma pressure levels had to be interpolated to standard pressure levels as provided in ERA5. We also interpolated the horizontal and vertical variables to the exact horizontal grid resolution before calculating the climatology. 

A global comparison between the reanalysis (ERA5) and the CMIP6 models shows that models tend to underestimate the ice water path compared to the reanalysis even if most of them can reproduce some of the characteristics of liquid water content and snowfall. To better understand the link between biases in cloud phase and surface snowfall rate, we try to find a relationship between ice water path and surface snowfall in GCMs. Linear regressions within extratropical areas show a positive relationship between ice water content and surface snowfall in the reanalysis data, while CMIP6 models do not have these characteristics. 

  

[1] Zelinka, M. D., Myers, T. A., McCoy, D. T., Po-Chedley, S., Caldwell, P. M., Ceppi, P., et al. (2020). Causes of higher climate sensitivity in CMIP6 models. Geophysical Research Letters, 47, e2019GL085782. https://doi-org.ezproxy.uio.no/10.1029/2019GL085782 

[2] Bjordal, J., Storelvmo, T., Alterskjær, K. et al. Equilibrium climate sensitivity above 5 °C plausible due to state-dependent cloud feedback. Nat. Geosci. 13, 718–721 (2020). https://doi-org.ezproxy.uio.no/10.1038/s41561-020-00649-1 

 

Github: https://github.com/franzihe 

How to cite: Hellmuth, F., Fouilloux, A. C. M., Storelvmo, T., and Daloz, A. S.: Is there a correlation between the cloud phase and surface snowfall rate in GCMs?, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-4566, https://doi.org/10.5194/egusphere-egu22-4566, 2022.

14:01–14:07
|
EGU22-2746
|
On-site presentation
|
Derek O'Callaghan and Sheila McBreen

The expansion of renewable energy portfolios to utilise offshore wind resources is a key objective of energy policies focused on the generation of low carbon electricity. Wind atlases have been developed to provide energy resources maps, containing information on wind speeds and related variables at multiple heights above sea level for offshore regions of interest (ROIs). However, these atlases are often associated with legacy projects, where access to corresponding data products may be restricted preventing further development by third parties. Reliable, long-term observations are crucial inputs to the offshore wind farm area assessment process, with observations typically measured close to the ocean surface using in situ meteorological masts. Remote sensing techniques have been proposed to address resolution and coverage issues associated with in situ measurements, in particular, the use of space-borne Earth Observation (EO) instruments for ocean and sea surface wind estimations. In recent years, a variety of initiatives have emerged that provide public access to wind speed data products, which have potential for application in wind atlas development and offshore wind farm assessment. Combining products from multiple data providers is challenging due to differences in spatial and temporal resolution, product access, and product formats. In particular, the associated large dataset sizes are significant obstacles to data retrieval, storage, and subsequent computation. The traditional process of retrieval and local analysis of a relatively small number of ROI products is not readily scalable to accommodate longitudinal studies of multiple ROIs. 

This work presents a case study that demonstrates the utility of the Pangeo software ecosystem to address these issues in the development of offshore wind speed and power density estimations, increasing wind measurement coverage of offshore renewable energy assessment areas in the Irish Continental Shelf region. The Intake library is used to manage a new data catalog created for this region, consisting of a collection of analysis-ready, cloud-optimized (ARCO) datasets generated using the Zarr format. This ARCO catalog features up to 21 years of available in situ, reanalysis, and satellite observation data products. The xarray and Dask libraries enable scalable catalog processing, including analysis of provided data variables and derivation of new variables as required for candidate wind farm ROIs, avoiding redundant storage and processing requirements for regions not under assessment. Individual catalog datasets have been regridded to relevant spatial grids, or appropriately chunked in time and space, by means of the xESMF and Rechunker libraries respectively. A set of Jupyter notebooks has been created to demonstrate catalog visualization and processing, following the conventions of notebooks in the current Pangeo Gallery. These notebooks provide detailed descriptions of each ARCO dataset, along with an evaluation of wind speed extrapolation and power density estimation methods. The employment of new approaches such as Pangeo Forge for future catalog and dataset creation is also explored. This case study has determined that the Pangeo ecosystem approach is extremely beneficial in the development of open architectures operating on large volumes of disparate data, while also contributing to the objectives of scientific code sharing and reproducibility.

How to cite: O'Callaghan, D. and McBreen, S.: Scalable Offshore Wind Analysis With Pangeo, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-2746, https://doi.org/10.5194/egusphere-egu22-2746, 2022.

14:07–14:13
|
EGU22-13556
|
Virtual presentation
|
Mario Echeverri Bautista, Maximilian Maahn, Anton Verhoef, and Ad Stoffelen

Modern Machine Learning (ML) techniques applied in atmospherical modeling rely heavily on two
aspects: good quality and good coverage observations. Among others, Satellite
Radiometer (SR) measuremens (Radiances or Brightness Temperatures) offer an excellent trade off
between such aspects; moreover SR observations have been providing quite stable Fundamental Cli-
mate Data Records (FCDR) for years and are expected to continue to do so in the following decades.
This work presents a framework for SR retrievals that uses modern ML standard packages from
the Scipy and Pangeo ecosystems; moreover, our retrieval scheme leverage the powerful
capabilites provided by NWPSAF’s RTTOV and its Python wrapper.
In terms of retrievals we stand on the shoulders of Bayesian Estimation by using Optimal Estima-
tion (OE), popularized by Rodgers for 1D atmospherical retrievals; we use pyOpEst
which is an open source package developed by Maahn. PyOptimalEstimation is structured
following an Object Oriented design, which makes it portable and highly maintainable.

The contribution presented here ranges from the scientific software design aspects, algorithmic
choices, open source contributions, processing speed and scalability; furthermore, simple but effi-
cient techniques such as cross-validation were used to evaluate different metrics; for initial test-
ing we have used NWPSAF’s model data and observation error covariances from SR literature.

The open source and community development philosophy are two pillars of this work. Open source
allows a transparent, concurrent and continuous development while community development brings
together domain experts, software developers and scientists in general; these two ideas allow us to
both profit from already developed and well supported tools (e.g. Scipy and Pangeo) and contribute
for others whose applications might benefit. This methodology has been successfully used all over the
Data Science and ML universe and we believe that the Earth Observation (EO) community would highly benefit in terms of streamlining development and benchmarking of new solutions. Practical examples of success can be found in the Pytroll community.

Our work in progress is directly linked to present and near future requirements by Earth Observa-
tion, in particular the incoming SR streams of data (for operational purposes) is increasing fast
and by orders of magnitude. Missions like the EUMETSAT Polar System-Second Generation (EPS-
SG, 2023) or the Copernicus Microwave Imager Radiometer (CIMR, 2026) will require scalability
and flexibility from the tools to digest such flows of data. We will discuss and show how operational
tools can take advantage of the enormous community based developments and standards and become
game changers for EO.

How to cite: Echeverri Bautista, M., Maahn, M., Verhoef, A., and Stoffelen, A.: Atmospheric Retrievals in a Modern Python Framework, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13556, https://doi.org/10.5194/egusphere-egu22-13556, 2022.

14:13–14:19
|
EGU22-13193
|
On-site presentation
|
florian pinault, Aaron Spring, Frederic Vitart, and Baudouin Raoult

As machine learning algorithms are being used more and more prominently in the meteorology and climate domains, the need for reference datasets has been identified as a priority. Moreover, boilerplate code for data handling is ubiquitous in scientific experiments. In order to focus on science, climate/meteorology/data scientists need generic and reusable domain-specific tools. To achieve these goals, we used the plugin based CliMetLab python package along with many packages listed by Pangeo.  


Our use case consists in providing data for machine learning algorithms in the context of the sub-seasonal to seasonal (S2S) prediction challenge 2021. The data size is about 2 Terabytes of model predictions from three different models. We experimented with providing data in multiple formats: Grib, NetCDF, and Zarr. A Pangeo recipe (using the python package pangeo_forge_recipes) was used to generate Zarr data (relying heavily on xarray and dask for parallelisation). All three versions of the S2S data have been stored on an S3 bucket located on the ECMWF European Weather Cloud (ECMWF-EWC). 


CliMetLab aims at providing a simple interface to access climate and meteorological datasets, seamlessly downloading and caching data, converting to xarray datasets or panda dataframes, plotting data, feed them into machine learning frameworks such as tensorflow or pytorch. CliMetLab is open-source and still a Beta version (https://climetlab.readthedocs.io). The main target platform of CliMetLab is Jupyter notebooks. Additionally, a CliMetLab plugin allows shipping dataset-specific code along with a well-defined published dataset. Taking advantage of the CliMetLab tools to minimize the boilerplate code, a plugin has been developed for S2S data as a companion python package of the dataset.

How to cite: pinault, F., Spring, A., Vitart, F., and Raoult, B.: CliMetLab and Pangeo use case: Machine learning data pipeline for sub-seasonal To seasonal prediction (S2S), EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13193, https://doi.org/10.5194/egusphere-egu22-13193, 2022.

14:19–14:25
|
EGU22-9152
|
ECS
|
On-site presentation
|
Edouard Gauvrit, Jean-Marc Delouis, Marie-Noëlle Bouin, and François Boulanger

Ocean plays a key role in regulating climate through the dynamical coupling between sea surface and the atmosphere. Understanding this coupling is a key issue in climate change modeling, but an adapted statistical representation is still lacking. A strong limitation comes from the non-Gaussianities existing inside a wind over waves surface layer, where wind flows are constrained by the sea state and the swell. We seek an approach to describe statistically the couplings across scales, which is poorly measured by the power spectrum. Recent developments in data science provide new tools as the Wavelet Scattering Transform (WST), which gives a low-variance statistical description of non-Gaussian processes and offers to go beyond the power spectrum representation. The latter is blind to position consistency between scales. To find the methodology, we applied the WST on 1D anemometer time series and 2D atmospheric simulations (LES) and compared them with well known statistical information. These analyses were made possible thanks to the development of WOAST (Wavelet Ocean-Atmosphere Scattering Transform) software. Computation of WST is mathematically embarrassingly parallel and the time consumption is mainly dominated by data access and memory management. Our preliminary geophysical analysis using WOAST and its efficiency of extracting unknown properties of intermittent processes will be shown through a jupyter notebook example. This work is part of the Astrocean project supported by 80Prime grants (CNRS).

How to cite: Gauvrit, E., Delouis, J.-M., Bouin, M.-N., and Boulanger, F.: WOAST : an Xarray package applying Wavelet Scattering Transform to geophysical data, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9152, https://doi.org/10.5194/egusphere-egu22-9152, 2022.

14:25–14:31
|
EGU22-6350
|
Virtual presentation
Valerie Garnier, Jean-Francois Le Roux, Justus Magin, Tina Odaka, Pierre Garreau, Martial Boutet, Stephane Raynaud, Claude Estournel, and Jonathan Beuvier

OSDYN (Observations and Simulations of the DYNamics) is a Python library that proposes diagnostics to explore the dynamics of the ocean and its interactions with the atmosphere and waves. Its main strengths are its genericity concerning the different types of netCDF files and its ability to handle large volumes of data.

Dedicated to large data sets such as in-situ, satellite, and numerical model observations, OSDYN is particularly powerful to manage different types of Arakawa-C grids and vertical coordinates (Nemo, Croco, Mars, Symphonie, WW3, MesoNH). Based on common Pangeo stack (xarray, dask, xgcm), OSDYN provides data readers that standardize the dimensions, coordinates, and variables names and properties of the datasets. Thus, all python diagnostics can be shared regardless of the model outputs.

Thanks to progress made using kerchunk and efforts on transforming metadata of Ifremer’s HPC center (auto-kerchunk), the reading of a large amount of netCDF files is fast and the selection of sub-domains or specific variables is almost immediate.

Jupyter notebooks will detail the implementation of three kinds of analyses. The first one focuses on climatologic issues. In order to compare modeled and satellite sea surface temperatures, the second one addresses spatial interpolation and comparison of data when some may be missing. Lastly, the third analysis provides an overview of how diagnostics describing the formation of deep water masses can be used from different data sets.

How to cite: Garnier, V., Le Roux, J.-F., Magin, J., Odaka, T., Garreau, P., Boutet, M., Raynaud, S., Estournel, C., and Beuvier, J.: OSDYN: a new python tool for the analysis of high-volume ocean outputs., EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-6350, https://doi.org/10.5194/egusphere-egu22-6350, 2022.

14:31–14:37
|
EGU22-11595
|
ECS
|
Highlight
|
Virtual presentation
|
Justus Magin, Mathiew Woillez, Antoine Queric, and Tina odaka

In biologging, a small device attached to an animal is used to track its behaviour and environment. This data enables biologists to gain a better understanding of its movement, its preferred habitats, and the environmental conditions it needs to thrive, all of which is essential for the future protection of natural resources. For that, it is crucial to have georeferenced data of biological processes, such as fish migration, over a spatial and temporal range.

Since it is challenging to track fish directly in the water, models have been developed to geolocate fish from the high resolution temperature and pressure time series obtained from the data storage tag. In particular, reconstructing the trajectories of seabass using the temporal temperature changes obtained from biologging devices has been studied since 2010 (https://doi.org/10.1016/j.ecolmodel.2015.10.024). These fish tracks are computed based on the likelihood of the temperature data obtained from the fish tag and reference geoscience data such as satellite observations and ocean physics model output. A high temporal and spatial resolution of the reference data plays a key role in the quality of the fish trajectories. However, the size and accessibility of these data sets as well as the computing power required to process high resolution data remain technical barriers.

As the Pangeo ecosystem has been developed  to solve such challenges in geoscience, we can take advantage of it in biologging. We use libraries such as intake, kerchunk, and fsspec to quickly load the data, xarray, pint, and dask to compute and hvplot and jupyter to display the results. The pangeo software stack enables us to easily access the data and compute high resolution fish tracks in a scalable and interactive manner. 

How to cite: Magin, J., Woillez, M., Queric, A., and odaka, T.: Pangeo for geolocating fish using biologging data, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11595, https://doi.org/10.5194/egusphere-egu22-11595, 2022.

14:37–14:43
|
EGU22-7610
|
Virtual presentation
|
Jacob Tomlinson

There are many powerful libraries in the Python ecosystem for accelerating the computation of large arrays with GPUs. We have CuPy for GPU array computation, Dask for distributed computation, cuML for machine learning, Pytorch for deep learning and more. We will dig into how these libraries can be used together to accelerate geoscience workflows and how we are working with projects like Xarray to integrate these libraries with domain-specific tooling. Sgkit is already providing this for the field of genetics and we are excited to be working with community groups like Pangeo to bring this kind of tooling to the geosciences.

How to cite: Tomlinson, J.: Distributing your GPU array computation in Python, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7610, https://doi.org/10.5194/egusphere-egu22-7610, 2022.

14:43–14:50
Coffee break
Chairpersons: Daniel Caviedes-Voullième, Tina Odaka
15:10–15:15
15:15–15:21
|
EGU22-13542
|
Highlight
|
Virtual presentation
Jenni Kontkanen, Pekka Manninen, Francisco Doblas-Reyes, Sami Niemelä, and Bjorn Stevens

Climate change will have far reaching impacts on human and natural systems during the 21st century. To increase the understanding of the present and future climate impacts and build resilience, improved Earth system modelling is required. The European Commission Destination Earth (DestinE) initiative aims to contribute to this by developing high precision digital twins (DTs) of the Earth. We present our solution to a climate-change adaptation DT, which is one of the two DTs developed during the first phase of DestinE. The objective of the climate change adaptation DT is to improve the assessment of the impacts of climate change and different adaptation actions at regional and national levels over multi-decadal timescales. This will be achieved by using two storm- and eddy-resolving global climate models, ICON (Icosahedral Nonhydrostatic Weather and Climate Model) and IFS (Integrated Forecasting System). The models will be run at a resolution of a few km on pre-exascale LUMI and MareNostrum5 supercomputers, which are flagship systems of the European High Performance Computing Joint Undertaking (EuroHPC JU) network. Following a radically different approach, climate simulations will be combined with a set of impact models, which enables assessing impacts on different sectors and topics, such as forestry, hydrology, cryosphere, energy, and urban areas. The end goal is to create a new type of climate simulations, in which user requirements are an integral part of the workflow, and thus adaptation solutions can be effectively deployed.

How to cite: Kontkanen, J., Manninen, P., Doblas-Reyes, F., Niemelä, S., and Stevens, B.: Climate change adaptation digital twin to support decision making, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13542, https://doi.org/10.5194/egusphere-egu22-13542, 2022.

15:21–15:27
|
EGU22-3285
|
ECS
|
Virtual presentation
|
Chen Yang, Carl Ponder, Bei Wang, Hoang Tran, Jun Zhang, Jackson Swilley, Laura Condon, and Reed Maxwell

Unprecedented climate change and anthropogenic activities have induced increasing ecohydrological issues. Large-scale hydrologic modeling of water quantity is developing rapidly to seek solutions for those issues. Water-parcel transport (e.g., water age, water quality) is as important as water quantity to understand the changing water cycle. However, current scientific progress in water-parcel transport at large-scale is far behind that in water quantity. The known cause is the lack of powerful tools to handle observations and/or modeling of water-parcel transport at large-scale with high spatiotemporal resolutions. Lagrangian particle tracking based on integrated hydrologic modeling stands out among other methods because it accurately captures the water-parcel movements. Nonetheless, Lagrangian approach is computationally expensive, hindering its broad application in hydrologic modeling, particularly at large-scale. EcoSLIM, a grid-based particle tracking code, calculates water ages (e.g., evapotranspiration, outflow, and groundwater) and identifies source water composition (e.g., rainfall, snowmelt, and initial subsurface water), working seamlessly with the integrated hydrologic model ParFlow-CLM. EcoSLIM is written in Fortran and is originally parallelized by OpenMP (Open Multi-Processing) using shared CPU memory. As a result, we accelerate EcoSLIM by implementing it on distributed, multi-GPU platform using CUDA (Compute Unified Device Architecture) Fortran.

We decompose the modeling domain into subdomains. Each GPU is responsible for one subdomain. Particles moving out of a subdomain continue moving temporarily in halo grid-cells around the subdomain and then are transferred to the neighbor subdomains. Different transfer schemes are built to balance the simulation accuracy and the computing speed. Particle transfer leverages the CUDA-aware MPI (Message Passing Interface) to improve the parallel efficiency. Load imbalance among GPUs induced by irregular domain boundaries and heterogeneity of flow paths is observed. A load-balancing scheme, borrowed from Particle-In-Cell and modified based on the characteristics of EcoSLIM, is established. The simulation starts on a number of GPUs fewer than the total scheduled GPUs. The manager MPI process activates an idle GPU for a subdomain once the particle number on its current GPU(s) is over a specified threshold. Finally, all scheduled GPUs are enabled. Tests of the new code from catchment-scale (the Little Washita watershed), to regional-scale (the North China Plain), and then to continental-scale (the Continental US) using millions to billions of particles show significant speedup and great parallel performance. The parallelized EcoSLIM is a promising tool for the hydrologic community to accelerate our understanding of the terrestrial water cycle beyond the water balance in the changing world.

How to cite: Yang, C., Ponder, C., Wang, B., Tran, H., Zhang, J., Swilley, J., Condon, L., and Maxwell, R.: Accelerating the Lagrangian particle tracking in hydrologic modeling at continental-scale, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3285, https://doi.org/10.5194/egusphere-egu22-3285, 2022.

15:27–15:33
|
EGU22-7869
|
ECS
|
On-site presentation
|
Mario Acosta, Venkatramani Balaji, Sergi Palomas, and Stella Paronuzzi

The increase in Earth System Models (ESMs) capabilities is strongly linked to the amount of computing power and data storage capacity available. The scientific community requires increased model resolution, large numbers of experiments and ensembles to quantify uncertainty, increased complexity of ESMs (including additional components), and longer simulation periods compared to the current state of climate models. HPC is currently undergoing a major change as is the next generation of computing systems (‘exascale systems’). These challenges cannot be met by mere extrapolation but require radical innovation in several computing technologies and numerical algorithms. Most applications targeting exascale machines require some degree of rewriting to expose more parallelism, and many face severe strong-scaling challenges if they are to effectively progress to exascale, as is demanded by their science goals. 

 

However, the performance evaluation of the new models through the exascale path will also increase its complexity. We do need new approaches to ensure that the computational evaluation of this new generation of models is done correctly. Moreover, this evaluation will help in the computational analysis during the model’s development and ensure the maximum throughput possible in the moment that operational configurations such as CMIP are run.

 

CPMIP metrics are a universal set of metrics easy to collect, which provide a new way to study ESMs from a computational point of view. Thanks to the H2020 project IS-ENES3, we had a unique opportunity to exploit this new set of metrics to create a novel database based on CMIP6 experiments, using the different models and platforms available all across Europe.

 

The results and analysis are presented here, where both differences and similarities among the models can be observed on a variety of different hardware. Moreover, the current database is presented for different studies, such as the comparison of different models running similar configurations or the same model and configuration but executed on different platforms. All these possibilities create a unique context that has to be exploited by the community to improve the evaluation of the computational performance of the ESMs, using this information for future optimizations and preparing our models for the new exascale platforms. Eventually, general prescriptions on how to disseminate the work done are given, and the need for the community to undertake the use of CPMIP metrics both on actual and new generation's platform is presented.

How to cite: Acosta, M., Balaji, V., Palomas, S., and Paronuzzi, S.: CPMIP: Computational evaluation of the new era of complex Earth System Models. Multi-model results from CMIP6 and challenges for the exascale computing., EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7869, https://doi.org/10.5194/egusphere-egu22-7869, 2022.

15:33–15:39
|
EGU22-3095
|
ECS
|
Presentation form not yet defined
|
Milan Klöwer, Samuel Hatfield, Matteo Croci, Peter D. Düben, and Tim Palmer

Most Earth-system simulations run on conventional CPUs in 64-bit double precision floating-point numbers Float64, although the need for high-precision calculations in the presence of large uncertainties has been questioned. Fugaku, currently the world’s fastest supercomputer, is based on A64FX microprocessors, which also support the 16-bit low-precision format Float16. We investigate the Float16 performance on A64FX with ShallowWaters.jl, the first fluid circulation model that runs entirely with 16-bit arithmetic. The model implements techniques that address precision and dynamic range issues in 16 bits. The precision-critical time integration is augmented to include compensated summation to minimise rounding errors. Such a compensated time integration is as precise but faster than mixed-precision with 16 and 32-bit floats. As subnormals are inefficiently supported on A64FX the very limited range available in Float16 is 6·10-5 to 65,504. We develop the analysis-number format Sherlogs.jl to log the arithmetic results during the simulation. The equations in ShallowWaters.jl are then systematically rescaled to fit into Float16, using 97% of the available representable numbers. Consequently, we benchmark speedups of up to 3.8x on A64FX with Float16. Adding a compensated time integration, speedups reach up to 3.6x. Although ShallowWaters.jl is simplified compared to large Earth-system models, it shares essential algorithms and therefore shows that 16-bit calculations are indeed a competitive way to accelerate Earth-system simulations on available hardware.

How to cite: Klöwer, M., Hatfield, S., Croci, M., Düben, P. D., and Palmer, T.: Fluid simulations accelerated with 16 bits: Approaching 4x speedup on A64FX by squeezing ShallowWaters.jl into Float16, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3095, https://doi.org/10.5194/egusphere-egu22-3095, 2022.

15:39–15:45
|
EGU22-8094
|
ECS
|
On-site presentation
|
Jan Streffing, Xavier Yepes-Arbós, Mario C. Acosta, and Kim Serradell

Current Earth System Models (ESMs) produce a large amount of data due to an increase in the simulated complexity of the models and their demanded rising spatial resolution. With the exascale era approaching rapidly, an efficient I/O will be critical to sustain model throughput. The most commonly adopted approach in ESMs is the use of scalable parallel I/O solutions that are intended to minimize the overhead of writing data into the storage system. However, I/O servers with inline diagnostics introduce more complexity and many parameters that need to be tuned. This means that it is necessary to achieve an optimal trade-off between throughput and resource usage.

ESMs are usually run on different platforms which might have different architectural specifications: latency, bandwidth, number of cores and memory per node, file system, etc. In addition, a single ESM can run different configurations which require different amounts of resources, resolution, output frequency, number of fields, etc. Since each individual case is particular, the I/O server should be tuned accordingly to each platform and model configuration.

We present an approach to identify and tune a series of important parameters that should be considered in an I/O server. In particular, we focus on the XML Input/Output Server (XIOS) and we use it integrated with OpenIFS –an atmospheric general circulation model– as a case study. We do not only tune basic parameters such as number of XIOS servers, number of servers per node, type and frequency of post-processing operations, etc., but also specific ones such as XIOS buffer size, splitting of NetCDF files across I/O servers, Lustre striping, 2-level server mode of XIOS, etc.

The evaluation of different configurations on different machines proves that it is possible and necessary to find a proper setup for XIOS to achieve a good throughput using an adequate consumption of computational resources. In addition, the results show that the OpenIFS-XIOS integration is performant on the platforms evaluated. This suggests that the integration is portable, though it was initially developed for a specific platform.

How to cite: Streffing, J., Yepes-Arbós, X., C. Acosta, M., and Serradell, K.: Approach to make an I/O server performance-portable across different platforms: OpenIFS-XIOS integration as a case study, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8094, https://doi.org/10.5194/egusphere-egu22-8094, 2022.

15:45–15:51
|
EGU22-2431
|
ECS
|
Presentation form not yet defined
Dmitrii Tolmachev, Andrew Jackson, Philippe Marti, and Giacomo Castiglioni

QuiCC is a code designed to solve the equations of magnetohydrodynamics in a full sphere and other geometries. The aim is to provide understanding of the dynamo process
that sustains planetary magnetic fields for billions of years by thermally-driven convective motion of an electrically-conducting fluid. It also aims to provide the first
clues as to how and why the magnetic fields can undergo reversals. The code must solve the coupled equations of conservation of momentum (the Navier Stokes equation), Maxwell's equations of electrodynamics and the equation of heat transfer. For accuracy and to facilitate imposition of boundary conditions,
a fully spectral method is used in which angular variables in a spherical polar coordinate system are expanded in spherical harmonics, and radial variables are expanded in a
special polynomial expansion in Jones-Worland polynomials. As a result the coordinate singularities at the north and south poles and at the origin disappear.
The code is designed to run on upward of 10^4 processors using MPI and shows excellent scaling.
At the heart of the method is the ability to move between physical and spectral space by a variety of exact transforms: these involve the well-known Fast Fourier Transform (FFT) and also the Legendre transform and Jones-Worland transform.
In this talk we will focus on the latest advancements in the field of fast GPU algorithms for these types of discrete transforms. We present an extension to the publicly-released VkFFT library - GPU Fast Fourier Transform library for Vulkan, CUDA, HIP and OpenCL, that allows the calculation of the Discrete Cosine Transforms of types I-IV. This is a very exciting addition to what VkFFT can do as DCTs are often used in image processing, data compression and numerous other scientific tasks.
So far, this is the first publicly available optimized GPU implementation of DCTs. We also present our progress in creating efficient Spherical Harmonic transforms (SHTs) and radial transforms using  GPU implementations.  This talk will present Jones-Worland and Associated Legendre Polynomial Transforms for modern GPU architectures, implemented based on the VkFFT runtime kernel optimization model. Combined, they can be used to create a new era of full-sphere models for planetary simulations in geophysics.

How to cite: Tolmachev, D., Jackson, A., Marti, P., and Castiglioni, G.: Exploiting GPU capability in the fully spectral magnetohydrodynamics code QuICC, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-2431, https://doi.org/10.5194/egusphere-egu22-2431, 2022.

15:51–15:57
|
EGU22-10006
|
On-site presentation
Daniel Caviedes-Voullième, Jörg Benke, Ghazal Tashakor, Stefan Poll, and Ilya Zhukov

Multiphysics Earth system models are potentially good candidates for progressive porting of modules to run on accelerator hardware. Typically, these models have an inherently modular design to cope with the variety of numerical formulations and computational implementations required for the range of physical processes they represent. Progressively porting modules or submodels to accelerators such as GPUs implies that models must run on heterogeneous hardware. Foreseeably, exascale systems will make use of heterogeneous hardware, and therefore, exploring early on such heterogeneous configurations is of importance and a challenge.

The Terrestrial Systems Modelling Platform (TSMP) is a scale-consistent, highly modular, massively parallel, fully integrated soil-vegetation-atmosphere modelling system. Currently, TSMP is based on the COSMO atmospheric model, the CLM land surface model, and the ParFlow hydrological model, linked together by means of the OASIS3-MCT library.

Recently, ParFlow was ported to GPU, therefore enabling the possibility of running TSMP under a heterogeneous configuration, that is COSMO and CLM running on CPUs, and ParFlow running on GPUs. The different computational demands of each submodel inherently result in non-trivial load balancing across the submodels. This has been addressed by studying the performance and scaling properties of the system for specific problems of interest. The new heterogeneous configurations prompts a re-assessment of load balancing, performance and scaling, in order to identify optimal computational resource configurations, and re-evaluate the bottlenecks and inefficiencies that the heterogeneous model system can have.

In this contribution, we present first results on performance and scaling assessment of the heterogeneous TSMP, compared to its performance under homogeneous (CPU-only) configurations. We study strong and weak scaling, for different problem sizes, and evaluate parallel efficiency and power consumption, for homogeneous and heterogeneous jobs on the JUWELS supercomputer, and on the experimental DEEP-Cluster, both at the Jülich Supercomputing Centre. Additionally, we explore profiles and traces of selected cases, both on homogeneous and heterogeneous runs, to identify MPI communication bottlenecks and root causes of the load balancing issue.  

How to cite: Caviedes-Voullième, D., Benke, J., Tashakor, G., Poll, S., and Zhukov, I.: Scaling and performance assessment of TSMP under CPU-only and CPU-GPU configurations, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10006, https://doi.org/10.5194/egusphere-egu22-10006, 2022.

15:57–16:03
|
EGU22-11212
|
On-site presentation
|
Zbigniew Piotrowski, Daniel Caviedes Voullieme, Jaro Hokkanen, Stefan Kollet, and Olaf Stein

On the map of the ESM research and operational software efforts, a notable area is occupied by the mid-size codes that benefit from established code design and user base and are developed by the domain scientists. Contrary to the major operational frameworks and newly established software projects, however, developers of such codes cannot easily benefit from novel solutions providing performance portability, nor have access to software engineering teams capable to perform full code rewrite, aiming at novel hardware architectures. While evolving accelerator programming paradigms like CUDA or OpenACC enable reasonably fast progress towards execution on heterogenous architectures, they do not offer universal portability and immediately impair code readability and maintainability. In this contribution we report on a lightweight embedded Domain Specific Language (eDSL) approach that enables legacy CPU codes to execute on GPU. In addition, it is minimally invasive  and maximizes code readability and developer productivity.  In the implementation, the eDSL serves as a front end for hardware dependent programming models, such as CUDA. In addition, performance portability can be achieved efficiently by implementing parallel execution and memory abstraction programming models, such as Kokkos as a backend. We evaluate the adaptation process and computational performance of the two established geophysical codes: the ParFlow hydrologic model written in C, and the Fortran-based dwarf encapsulating  MPDATA transport algorithm. Performance portability is demonstrated in the case of ParFlow. We present scalability results on state-of-the-art AMD CPUs and NVIDIA GPUs of JUWELS booster supercomputer. We discuss the advantages and limitations of the proposed approach in the context of other direct and DSL-based strategies allowing for exploitation of the modern accelerator-based computing platforms.

How to cite: Piotrowski, Z., Caviedes Voullieme, D., Hokkanen, J., Kollet, S., and Stein, O.: Lightweight embedded DSLs for geoscientific models., EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11212, https://doi.org/10.5194/egusphere-egu22-11212, 2022.

16:03–16:09
|
EGU22-10919
|
Virtual presentation
Catrin I. Meyer and the PilotLab ExaESM Team

The Pilot Lab Exascale Earth System Modelling (PL-ExaESM) is a “Helmholtz-Incubator Information & Data Science” project and explores specific concepts to enable exascale readiness of Earth System models and associated work flows in Earth System science. PL-ExaESM provides a new platform for scientists of the Helmholtz Association to develop scientific and technological concepts for future generation Earth System models and data analysis systems. Even though extreme events can lead to disruptive changes in society and the environment, current generation models have limited skills particularly with respect to the simulation of these events. Reliable quantification of extreme events requires models with unprecedentedly high resolution and timely analysis of huge volumes of observational and simulation data, which drastically increase the demand on computing power as well as data storage and analysis capacities. At the same time, the unprecedented complexity and heterogeneity of exascale systems, will require new software paradigms for next generation Earth System models as well as fundamentally new concepts for the integration of models and data. Specifically, novel solutions for the parallelisation and scheduling of model components, the handling and staging of huge data volumes and a seamless integration of information management strategies throughout the entire process-value chain from global Earth System simulations to local scale impact models are being developed in PL-ExaESM. The potential of machine learning to optimize these tasks is investigated. At the end of the project, several program libraries and workflows will be available, which provide the basis for the development of next generation Earth System models.

In the PL-ExaESM, scientists from 9 Helmholtz institutions work together to address 5 specific problems of exascale Earth system modelling:

  • Scalability: models are being ported to next-generation GPU processor technology and the codes are modularized so that computer scientists can better help to optimize the models on new hardware.
  • Load balancing: asynchronous workflows are being developed to allow for more efficient orchestration of the increasing model output while preserving the necessary flexibility to control the simulation output according to the scientific needs.
  • Data staging: new emerging dense memory technologies allow new ways of optimizing I/O operations of data-intensive applications running on HPC clusters and future Exascale systems.
  • System design: the results of dedicated performance tests of Earth system models and Earth system data workflows are analysed in light of potential improvements of the future exascale supercomputer system design.
  • Machine learning: modern machine learning approaches are tested for their suitability to replace computationally expensive model calculations and speed up the model simulations or make better use of available observation data.

How to cite: Meyer, C. I. and the PilotLab ExaESM Team: The Pilot Lab Exascale Earth System Modelling, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10919, https://doi.org/10.5194/egusphere-egu22-10919, 2022.

16:09–16:15
|
EGU22-12099
|
Virtual presentation
|
Karsten Peters-von Gehlen, Ivonne Anders, Daniel Heydebreck, Christopher Kadow, Florian Ziemen, and Hannes Thiemann

The German Climate Computing Center (DKRZ) is an established topical IT service provider serving the needs of the German climate science community and their associated partners. At DKRZ, climate researchers have the means available to cover every aspect of the research life cycle, ranging from planning, model development and testing, model execution on the in-house HPC cluster (16 PFlops mainly CPU-based, 130 PB disk storage), data analysis (batch jobs, Jupyter, Freva), data publication and dissemination via the Earth System Grid Federation (ESGF) as well as long-term data preservation either at the project-level (little curation) or in the CoreTrustSeal certified World Data Center for Climate (WDCC) (extensive curation along the FAIR data principles). A plethora of user support services offered by domain-expert staff complement DKRZ’s portfolio.

 

With the new HPC system coming online in early 2022 and a number of funded and to-be funded projects exploiting the available computational resources for conducting e.g. global storm-resolving (grid spacing O(1-3km)) simulations on climatic timescales, the current interplay DKRZ’s services needs to be revisited to devise a unified workflow that will be able to handle the upcoming challenges. 

 

This is why the above mentioned projects will supply a significant amount of funds to conceive a framework to efficiently orchestrate the entire model development, model execution and data handling workflow at DKRZ in close collaboration with the climate science community.

 

In this contribution, we will detail our vision of a revamped and versatile ESM orchestration framework at DKRZ. Currently, this vision is based on having the orchestration performed by the Freva System (http://doi.org/10.5334/jors.253), in which users will be able to kick-off model compilation, compute and analysis jobs. Furthermore, Freva enables seamless provenance tracking of the entire workflow. Together with the implementation of data publication, long-term archiving and data dissemination workflows, the envisioned system provides a complete package of FAIR Digital Objects (FDOs) to researchers and allows for reproducibility, transparency and reduction of data redundancy. 

How to cite: Peters-von Gehlen, K., Anders, I., Heydebreck, D., Kadow, C., Ziemen, F., and Thiemann, H.: A vision and strategy to revamp ESM workflows at DKRZ, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-12099, https://doi.org/10.5194/egusphere-egu22-12099, 2022.

16:15–16:40
Coffee break
Chairpersons: Charles Zender, V. Balaji
17:00–17:05
17:05–17:15
|
EGU22-3109
|
ECS
|
solicited
|
Presentation form not yet defined
Milan Klöwer, Miha Razinger, Juan J. Dominguez, Peter D. Düben, and Tim Palmer

Hundreds of petabytes are produced annually at weather and climate forecast centres worldwide. Compression is essential to reduce storage and to facilitate data sharing. Current techniques do not distinguish the real from the false information in data, leaving the level of meaningful precision unassessed or often subjectively chosen. Many of the trailing mantissa bits in floating-point numbers occur independently with high information entropy, reducing the efficiency of compression algorithms. Here we define the bitwise real information content from information theory as the mutual information of bits in adjacent grid points. The analysis automatically determines a precision from the data itself, based on the separation of real and false information bits. Applied to data from the Copernicus Atmospheric Monitoring Service (CAMS), most variables contain fewer than 7  bits of real information per value and are highly compressible due to spatio-temporal correlation. Rounding bits without real information to zero facilitates lossless compression algorithms and encodes the uncertainty within the data itself. The removal of bits with high entropy but low real information allows us to minimize information loss but maximize the efficiency of the compression algorithms. All CAMS data are 17x compressed in the longitudinal dimension and relative to 64-bit floats, while preserving 99% of real information. Combined with four-dimensional compression using the floating-point compressor Zfp, factors beyond 60x are achieved, with no significant increase of the forecast error. For multidimensional compression it is generally advantageous to include as many highly correlated dimensions as possible. A data compression Turing test is proposed to optimize compressibility while minimizing information loss for the end use of weather and climate forecast data. 

How to cite: Klöwer, M., Razinger, M., Dominguez, J. J., Düben, P. D., and Palmer, T.: Compressing atmospheric data into its real information content, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3109, https://doi.org/10.5194/egusphere-egu22-3109, 2022.

17:15–17:21
|
EGU22-8762
|
Virtual presentation
Allison H. Baker, Dorit M. Hammerling, Alex Pinard, and Haiying Xu

Climate models such as the Community Earth System Model (CESM) typically produce enormous amounts of output data, and storage capacities have not increased as rapidly as processor speeds over the years. As a result, the cost of storing huge data volumes has become increasingly problematic and has forced climate scientists to make hard choices about which variables to save, data output frequency, simulation lengths, or ensemble sizes, all of which can negatively impact science objectives.  Therefore, we have been investigating lossy data compression techniques as a means of reducing data storage for CESM.  Lossy compression, by definition, does not exactly preserve the original data, but it achieves higher compression rates and subsequently smaller storage requirements. However, as with any data reduction approach, we must exercise extreme care when applying lossy compression to climate output data to avoid introducing artifacts in the data that could affect scientific conclusions.  Our focus has been on better understanding the effects of lossy compression on spatio-temporal climate data and on gaining user acceptance via careful analysis and testing. In this talk, we will describe the challenges and concerns that we have encountered when compressing climate data from CESM and will discuss developing appropriate climate-specific metrics and tools to enable scientists to evaluate the effects of lossy compression on their own data and facilitate optimizing compression for each variable.  In particular, we will present our Large Data Comparison for Python (LDCPy) package for visualizing and computing statistics on differences between multiple datasets, which enables climate scientists to discover potentially relevant compression-induced artifacts in their data.  Additionally, we will demonstrate the usefulness of an alternative to the popular SSIM that we developed, called the Data SSIM (DSSIM), that can be applied directly to the floating-point data in the context of evaluating differences due to lossy compression on large volumes of simulation data.

How to cite: Baker, A. H., Hammerling, D. M., Pinard, A., and Xu, H.: Lossy Data Compression and the Community Earth System Model, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8762, https://doi.org/10.5194/egusphere-egu22-8762, 2022.

17:21–17:27
|
EGU22-10774
|
ECS
|
Presentation form not yet defined
|
Robert Underwood, Sheng Di, and Franck Cappello

Large scale climate simulations such as the Community Earth Science Model (CESM) produce enormous volumes of data per run. Transferring and storing this volume of data can be challenging leading researchers to consider data compression in order to mitigate the performance, monetary and environmental costs. In this work, we survey 8 methods ranging from higher-order SVD, multigrid, transform, and prediction based lossy compressors as well as specialized floating point lossless and lossy compressors and general lossless compressors to determine which methods are most effective at reducing the storage footprint.  We consider four components (atmosphere, ice, land, and ocean) within CESM taking into account the stringent quality thresholds required to preserve the integrity of climate research data. Our work goes beyond existing studies of compressor performance by considering these newer compression techniques, and by accounting for the candidate quality thresholds identified in prior work by Hammerling et al.  This provides a more realistic picture of the performance of lossy compression methods relative to lossless compression methods subject to each of these constraints with up to a 5.2x improvement over the leading lossless compressor and 21x over no compression. Our work features an automated method to automatically identify a configuration that satisfies the quality requirements for the lossy compressors that is agnostic to compressor implementations. 

How to cite: Underwood, R., Di, S., and Cappello, F.: Understanding the effects of Modern Lossless and Lossy Compressors on the Community Earth Science Model, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10774, https://doi.org/10.5194/egusphere-egu22-10774, 2022.

17:27–17:33
|
EGU22-9741
|
Presentation form not yet defined
|
Franck cappello, Sheng Di, and Robert Underwood

The projection into 2030 of the climate data volume increase brings an important challenge to the climate science community. This is particularly true for the CMIP7 that is projected to need about an Exabyte of storage capacity. Error-bounded lossy compression is explored as a potential solution to the above problem by different climate research teams. Several lossy compression schemes have been proposed leveraging different forms of decorrelation (transforms, prediction, HoSVD, DNN), quantization (linear, non-linear, vector), and encoding (dictionary-based, variable length, etc.) algorithms. Our experience with different applications shows that the compression methods often need to be customized and optimized to fit the specificities of the datasets to compress and the user requirements on the compression quality, ratio, and throughput. However, none of the existing lossy compression software for scientific data has been designed to be customizable. To address this issue, we developed SZ3, an innovative customizable, modular compression framework. SZ3 is a full C++ refactoring of SZ2 enabling the specialization, addition, or removal of each stage of the lossy compression pipeline to fit the specific characteristics of the datasets to compress and the use-case requirements. This extreme flexibility allows adapting SZ3 to many different use-cases, from ultra-high compression for visualization to ultra-high-speed compression between the CPU (or GPU) and the memory. Thanks to its unique set of features: customization, high compression ratio, high compression throughput, and excellent accuracy preservation, SZ3 won a 2021 R&D100 award. In this presentation, we present SZ3 and a new data prediction-based decorrelation method that significantly improves the compression ratios for climate datasets over the state-of-the-art lossy compressors, while preserving the same data accuracy. Experiments based on CESM datasets show that SZ3 can lead to up to 300% higher compression ratios than SZ2 with the same compression error bound and similar compression throughput.

How to cite: cappello, F., Di, S., and Underwood, R.: Improving lossy compression for climate datasets with SZ3, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9741, https://doi.org/10.5194/egusphere-egu22-9741, 2022.

17:33–17:39
|
EGU22-9948
|
Presentation form not yet defined
|
Julie Bessac, David Krasowksa, Robert Underwood, Sheng Di, Jon Calhoun, and Franck Cappello

Lossy compression plays a growing role in geophysical and other computer-based simulations where the cost of storing their output data on large-scale systems can span terabytes and even petabytes in some cases. Using error-bounded lossy compression reduces the amount of storage for each simulation; however, there is no known bound for the upper limit on lossy compressibility for a given dataset. Correlation structures in the data, choice of compressor and error bound are factors allowing larger compression ratios and improved quality metrics. Analyzing these three factors provides one direction towards quantifying limits of lossy compressibility. As a first step, we explore statistical methods to characterize correlation structures present in several climate simulations and their relationships, through functional regression models, to compression ratios. In particular, we show results for climate simulations from the Community Earth System Model (CESM) as well as for hurricanes simulations from Hurricane-ISABEL from IEEE Visualization 2004 contest, compression ratios of the widely used lossy compressors for scientific data SZ, ZFP and MGARD exhibit a logarithmic dependence to the global and local correlation ranges when combined with information on the variability of the considered fields through the variance or gradient magnitude. Further works will focus on providing a unified characterization of these relationships across compressors and error bounds. This consists of a first step towards evaluating the theoretical limits of lossy compressibility used to eventually predict compression performance and adapt compressors to correlation structures present in the data. 

How to cite: Bessac, J., Krasowksa, D., Underwood, R., Di, S., Calhoun, J., and Cappello, F.: Exploring Lossy Compressibility through Statistical Correlations of Geophysical Datasets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9948, https://doi.org/10.5194/egusphere-egu22-9948, 2022.

17:39–17:45
|
EGU22-9153
|
ECS
|
On-site presentation
|
Xavier Yepes-Arbós, Sheng Di, Kim Serradell, Franck Cappello, and Mario C. Acosta

Earth system models (ESMs) have increased the spatial resolution to achieve more accurate solutions. As a consequence, the number of grid points increases dramatically, so an enormous amount of data is produced as simulation results. In addition, if ESMs manage to take advantage of the upcoming exascale computing power, their current data management system will become a bottleneck as the data production will grow exponentially.

The XML Input/Output Server (XIOS) is an MPI parallel I/O server designed for ESMs to efficiently post-process data inline as well as read and write data in NetCDF4 format. Although it offers a good performance in terms of computational efficiency for current resolutions, this could change for larger resolutions since the XIOS performance is very dependent on the output size. To address this problem we test the HDF5 compression in order to reduce the size of the data so that both I/O time and storage footprint can be improved. However, the default lossless compression filter of HDF5 does not provide a good trade-off between size reduction and computational cost. 

Alternatively, we consider using lossy compression filters that may allow reaching high compression ratios and enough compression speed to considerably reduce the I/O time while keeping high accuracy. In particular, we are exploring the feasibility of using the SZ lossy compressor developed by the Argonne National Laboratory (ANL) to write highly compressed NetCDF files through XIOS. As a case study, the Open Integrated Forecast System (OpenIFS) is used, an atmospheric general circulation model that can use XIOS to output data.

How to cite: Yepes-Arbós, X., Di, S., Serradell, K., Cappello, F., and C. Acosta, M.: Exploring the SZ lossy compressor use for the XIOS I/O server, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9153, https://doi.org/10.5194/egusphere-egu22-9153, 2022.

17:45–17:51
|
EGU22-946
|
ECS
|
Virtual presentation
|
Samuel Li and John Clyne

Much of the research in lossy data compression has focused on minimizing the average error for a given storage budget. For scientific applications, the maximum point-wise error is often of greater interest than the average error. This paper introduces an algorithm that encodes outliers—data points exceeding a specified point-wise error tolerance—produced by a lossy compression algorithm optimized for minimizing average error. These outliers can then be corrected to be within the error tolerance when decoding. We pair this outlier coding algorithm with an in-house implementation of SPECK, a lossy compression algorithm based on wavelets that exhibits excellent rate-distortion performance (where distortion is measured by the average error), and introduce a new lossy compression product that we call SPERR. Compared to two leading scientific data compressors, SPERR uses less storage to guarantee an error bound and produces better overall rate-distortion curves at a moderate cost of added computation. Finally, SPERR facilitates interactive data exploration by exploiting the multiresolution properties of wavelets and their ability to reconstruct coarsened data volumes on the fly.

How to cite: Li, S. and Clyne, J.: Lossy Scientific Data Compression With SPERR, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-946, https://doi.org/10.5194/egusphere-egu22-946, 2022.

17:51–17:57
|
EGU22-3230
|
Presentation form not yet defined
|
Leigh Orf and Kelton Halbert

Here we discuss our experiences with of ZFP lossy floating point compression in eddy resolving cloud modeling simulations of violent thunderstorms executed on the Blue Waters and Frontera supercomputers. Lossy compression has reduced our simulation data load by a factor of 20-100 from uncompressed. This savings enables us to save data at extremely high temporal resolution, up to the model's time step, the smallest possible temporal discretization. Further data savings is realized by only saving a subdomain of the entire simulation, and this is has opened the door to new approaches to analysis.  We will discuss the Lack Of a File System (LOFS) compressed format that model data is saved in, as well as conversion routines to create individual ZFP compressed NetCDF4 files for shaing with collaborators and for archiving. Further, we will discuss the effect of lossy compression on offline Lagrangian parcel analysis from LOFS data. Preliminary results suggest that high compression does not alter parcel paths considerably in cloud model simulation data over several minutes of integration as compared to uncompressed.

How to cite: Orf, L. and Halbert, K.: Lossy compression in violent thunderstorm simulations: Lessons learned and future goals, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-3230, https://doi.org/10.5194/egusphere-egu22-3230, 2022.

17:57–18:03
|
EGU22-7151
|
ECS
|
Virtual presentation
|
Ezequiel Cimadevilla and Antonio S. Cofiño

Climate datasets are usually provided in separate files that facilitate dataset management in climate data distribution systems. In ESGF1 (Earth System Grid Federation) a time series of a variable is split into smaller pieces of data in order to reduce file size. Although this enhances usability for data management in the ESGF distribution system (i.e. file publishing, download, …), this demotes usability for data analysis. Usually, workflows need to pre-process and rearrange multiple files as a single data source, in order to obtain a data analysis dataset, involving data rewriting and duplication with the corresponding storage growth.

The mitigation of storage growth can be achieved by creating virtual views, allowing a number of actual datasets to be multidimensionally mapped together into a single multidimensional dataset that does not require rewriting data nor to consume additional storage. Due to the increasing interest in offering to climate researchers appropriate single data analysis datasets, some mechanisms have been or are being developed to tackle this issue, such as NcML (netCDF Markup Language), xarray/netCDF-4 Multiple File datasets and H5VDS. HDF5 Virtual Datasets3 (H5VDS) provide researchers with different views of interest of a compound dataset, without the cost of duplicating information, facilitating data analysis in an easy and transparent way.

In the climate community and in ESGF, netCDF is the standard data model and format for climate data exchange. netCDF-4 default storage format is HDF5, introducing into the netCDF library features from HDF5. This includes chunking2, compression2, virtual datasets and many other capabilities. H5VDS introduces a new dataset storage type that allows a number of multiple HDF5 (and netCDF-4) datasets to be mapped together into a single sliceable dataset via an interface layer. The datasets can be mixed in arbitrary combinations, based on range selection mapping to range selection on sources. This mapping allows mapping between different data types and to add, remove or modify existing metadata (i.e. datasets attributes), which usually it’s a common issue to access the data. 

In this work, examples of applications of H5VDS features are applied to CMIP6 climate simulations datasets from ESGF, in order to provide data analysis ready virtual datasets. Examples of common tools/libraries (i.e. netcdf-c, xarray, nco, cdo, …) illustrate the convenience of the proposed approach. Using H5VDS facilitates data analysis workflows by enabling climate researchers to focus on data analysis rather than data engineering tasks. Also, since the H5VDS is created at the storage layer, these datasets are transparent to the netCDF-4 library and existing applications can benefit from this feature.

References

[1] L. Cinquini et al., “The Earth System Grid Federation: An open infrastructure for access to distributed geospatial data,” Future Generation Computer Systems, vol. 36, pp. 400–417, 2014, doi: 10.1016/j.future.2013.07.002. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0167739X13001477. [Accessed: 16-Jan-2020]
[2] The HDF Group, “Chunking in HDF5”, 11-Feb.-2019. [Online]. Available: https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5. [Accessed: 12-Jan.-2022]

[3] The HDF Group, “Virtual Dataset VDS”, 06-Apr.-2018. [Online]. Available: https://portal.hdfgroup.org/display/HDF5/Virtual+Dataset++-+VDS. [Accessed: 12-Jan.-2022]

Acknowledgements

This work it’s been developed under support from IS-ENES3 which is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824084.

How to cite: Cimadevilla, E. and Cofiño, A. S.: Storage growth mitigation through data analysis ready climate datasets using HDF5 Virtual Datasets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7151, https://doi.org/10.5194/egusphere-egu22-7151, 2022.

18:03–18:09
|
EGU22-1149
|
On-site presentation
|
Rostislav Kouznetsov

Lossy compression methods are extremely efficient in terms of space and performance and allow for reduction of network bandwidth and disk space needed to store data arrays without sacrificing the number of stored values.  Lossy compression involves an irreversible transformation of data that reduces the information content of the data.  The transformation introduces a distortion that is normally measured in terms of absolute or relative error. The error is higher for higher compression ratios.  A good choice of lossy compression parameters maximizes the compression ratio while keeping the introduced error within acceptable margins.  Negligence or failure to chose a right compression method or its parameters leads to poor compression ratio, or loss of the data.

A good strategy for lossy compression would involve sepcification of the acceptible error margin and choice of compression parameters and storage format. We will discuss specific techniques of lossy compression, and illustrate pitfalls in choice of the error margins and tools for lossy/lossless compression. The following specific topics will be covered:

1. Packing of floating-point data to integers in NetCDF is sub-optimal in most cases,   and for some quantities leads to severe errors.
2. Keeping relative vs absolute precision: false alternative.
3. Acceptible error margin depends on both the origin and the intended application of data.
4. Smart algorithms to decide on compression parameters have limited area of applicability,   which has to be considered in each individual case.
5. Choice of a format for compressed data (NetCDF, GRIB2, Zarr): tradeoff between size, speed and precision.
6. What "number_of_significant_digits" and "least_significant_digit" mean in terms of relative/absolute error.
7. Bit-Shuffle is not always beneficial.

How to cite: Kouznetsov, R.: Practical notes on lossy compression of scientific data, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-1149, https://doi.org/10.5194/egusphere-egu22-1149, 2022.

18:09–18:15
|
EGU22-13259
|
Presentation form not yet defined
|
Edward Hartnett and Charles Zender

The increasing volume of Earth science data sets continue to present challenges for large data producers. In order to support lossy compression in the netCDF C and Fortran libraries, we have added a quantize feature for netCDF floating point variables. When the quantize feature is enabled, the data creator specifies the number of significant digits. As data are written, the netCDF libraries apply a quantization algorithm which guarantees that the number of significant digits (for BitGroom and Granular BitRound algorithms) or bits (for BitRound algorithm) will be preserved, while setting unneeded bits to a constant value. This allows zlib lossless compression (or any other lossless compression) to achieve better and faster compression.

How to cite: Hartnett, E. and Zender, C.: Adding Quantization to the NetCDF C and Fortran Libraries to Enable Lossy Compression, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13259, https://doi.org/10.5194/egusphere-egu22-13259, 2022.

18:15–18:30