Performance gains in an ESM using parallel ad-hoc file systems
- Karlsruhe Institute of Technology, Steinbuch Centre for Computing, Eggenstein-Leopoldshafen, Germany
Earth System Models (ESM) got much more demanding over the last years. Modelled processes got more complex and more and more processes are considered in models. In addition resolutions of the models got higher to improve weather and climate forecasts. This requires faster high performance computers (HPC) and better I/O performance.
Within our Pilot Lab Exascale Earth System Modelling (PL-EESM) we do performance analysis of the ESM EMAC using a standard Lustre file system for output and compare it to the performance using a parallel ad-hoc overlay file system. We will show the impact for two scenarios: one for todays standard amount of output and one with artificial heavy output simulating future ESMs.
An ad-hoc file system is a private parallel file system which is created on-demand for an HPC job using the node-local storage devices, in our case solid-state-disks (SSD). It only exists during the runtime of the job. Therefore output data have to be moved to a permanent file system before the job has finished. Quasi in-situ data analysis and post-processing allows to gain performance as it might result in a decreased amount of data which you have to store - saving disk space and time during the transfer of data to permanent storage. We will show first tests for quasi in-situ post-processing.
How to cite: Versick, S., Kirner, O., Meyer, J., Obermaier, H., and Soysal, M.: Performance gains in an ESM using parallel ad-hoc file systems, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-18121, https://doi.org/10.5194/egusphere-egu2020-18121, 2020
Comments on the display
AC: Author Comment | CC: Community Comment | Report abuse
The slides are very interesting and the overall decrease in runtime by using the BeeONE approach is great. A few questions:
Point 1 and 2: Yes, you understand it correctly. Model output is directly stored at the local SSDs during the model run. After the model run is finished (executable which is called in a shell runscript), there are some "post-processing" routines in the runscript. One of this routines is copying back the data. This routine is called after other postprocessing routines and before the next job in the chain is submitted (for a lot of ESM simlations the calculation time you can get in the HPC queueing system is not enough for the whole simulation -> you need to do it in several steps). Depending on how much postprocessing you are doing and on how much data you have to transfer back it might be better to use half (or any other amount) of the nodes to already copy back some data while the other half is still doing postprocessing.
Point 3: The model is not aware of BeeOND. For the model it is just one large virtual filesystem. So for the actual output there is no adaptation needed but of course you have to implement the copy commands somewhere. Additionally: In our case the BeeOND filesystem does not have a fixed name for all jobs, so depending on how you handle the name of your outputdirectory you may have to introduce a few lines of code (for me it was one line in the runscript). The largest amount of work was to figure out which copy command is fastest. dcp should always be among the best performing tools no matter how your output looks.
Point 4: If the model crashes in a way that the complete job ended all the data is lost. That means if it crashes due to node failure or similar things there is probably nothing you can do about it. If your code crashes it will depend on what the model did so far. In our case data is already written and if the post-processing part in the runscript can be reached, the already written data will be transfered back. If a model error stops the complete job, you better change this part in your code.