EGU General Assembly 2020
© Author(s) 2020. This work is distributed under
the Creative Commons Attribution 4.0 License.

Performance gains in an ESM using parallel ad-hoc file systems

Stefan Versick, Ole Kirner, Jörg Meyer, Holger Obermaier, and Mehmet Soysal
Stefan Versick et al.
  • Karlsruhe Institute of Technology, Steinbuch Centre for Computing, Eggenstein-Leopoldshafen, Germany

Earth System Models (ESM) got much more demanding over the last years. Modelled processes got more complex and more and more processes are considered in models. In addition resolutions of the models got higher to improve weather and climate forecasts. This requires faster high performance computers (HPC) and better I/O performance.

Within our Pilot Lab Exascale Earth System Modelling (PL-EESM) we do performance analysis of the ESM EMAC using a standard Lustre file system for output and compare it to the performance using a parallel ad-hoc overlay file system. We will show the impact for two scenarios: one for todays standard amount of output and one with artificial heavy output simulating future ESMs.

An ad-hoc file system is a private parallel file system which is created on-demand for an HPC job using the node-local storage devices, in our case solid-state-disks (SSD). It only exists during the runtime of the job. Therefore output data have to be moved to a permanent file system before the job has finished. Quasi in-situ data analysis and post-processing allows to gain performance as it might result in a decreased amount of data which you have to store - saving disk space and time during the transfer of data to permanent storage. We will show first tests for quasi in-situ post-processing.

How to cite: Versick, S., Kirner, O., Meyer, J., Obermaier, H., and Soysal, M.: Performance gains in an ESM using parallel ad-hoc file systems, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-18121,, 2020

Comments on the presentation

AC: Author Comment | CC: Community Comment | Report abuse

Presentation version 1 – uploaded on 30 Apr 2020
  • CC1: Data copying from BeeONE to normal file system, Daniel Heydebreck, 05 May 2020

    The slides are very interesting and the overall decrease in runtime by using the BeeONE approach is great. A few questions:

    • For clarification: the BeeONE concept is that the model output is stored on the SSDs of the individual nodes, first, and, later on, merged and copied to the lustre file system. Did I understand the concept correct?
    • When is the data copied from the local SSD storage to the lustre file system -- during the next calculation phase (before the next output phase starts)?
    • Is the model awared of what BeeONE does or does BeeONE just look like one large virtual file system? In other words: how much work is it to adapt another model to BeeONE?
    • What happens in the model simulation crashes and the job ends while the BeeONE data is still being copied from the local SSDs to the lustre file system?
    • AC1: Reply to CC1, Stefan Versick, 05 May 2020

      Point 1 and 2: Yes, you understand it correctly. Model output is directly stored at the local SSDs during the model run. After the model run is finished (executable which is called in a shell runscript), there are some "post-processing" routines in the runscript. One of this routines is copying back the data. This routine is called after other postprocessing routines and before the next job in the chain is submitted (for a lot of ESM simlations the calculation time you can get in the HPC queueing system is not enough for the whole simulation -> you need to do it in several steps). Depending on how much postprocessing you are doing and on how much data you have to transfer back it might be better to use half (or any other amount) of the nodes to already copy back some data while the other half is still doing postprocessing.

      Point 3: The model is not aware of BeeOND. For the model it is just one large virtual filesystem. So for the actual output there is no adaptation needed but of course you have to implement the copy commands somewhere. Additionally: In our case the BeeOND filesystem does not have a fixed name for all jobs, so depending on how you handle the name of your outputdirectory you may have to introduce a few lines of code (for me it was one line in the runscript). The largest amount of work was to figure out which copy command is fastest. dcp should always be among the best performing tools no matter how your output looks.

      Point 4: If the model crashes in a way that the complete job ended all the data is lost. That means if it crashes due to node failure or similar things there is probably nothing you can do about it. If your code crashes it will depend on what the model did so far. In our case data is already written and if the post-processing part in the runscript can be reached, the already written data will be transfered back. If a model error stops the complete job, you better change this part in your code.