EGU23-12785, updated on 22 Apr 2023
https://doi.org/10.5194/egusphere-egu23-12785
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

A Scalable Near Line Storage Solution for Very Big Data

Neil Massey1, Jack Leland1, and Bryan Lawrence2
Neil Massey et al.
  • 1Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell, Didcot, United Kingdom of Great Britain – England, Scotland, Wales (neil.massey@stfc.ac.uk)
  • 2NCAS, Department of Meteorology, University of Reading, and Department of Computer Science, University of Reading, Reading, UK

Managing huge volumes of data is a problem now, and will only become worse with the advent of exascale computing and next generation observational systems. An important recognition is that data needs to be more easily migrated between storage tiers. Here we present a new solution, the Near-Line Data store (NLDS), for managing data migration between user facing storage systems and tape by using an object storage cache.  NLDS builds on lessons learned from previous experience developing the ESIWACE funded Joint Data Migration App (JDMA) and deploying it at the Centre for Environmental Data Analysis (CEDA). 
 
CEDA currently has over 50PB of data stored on a range of disk based storage systems.  These systems are chosen on cost, power usage and accessibility via a network, and include three different types of POSIX disk and object storage. Tens of PB of additional data are also stored on tape. Each of these systems has different workflows, interfaces and latencies, causing difficulties for users.  

NLDS, developed with ESIWACE2 and other funding, is a multi-tiered storage solution using object storage as a front end to a tape library.  Users interact with NLDS via a HTTP API, with a Python library and command-line client provided to support both programmatic and interactive use.  Files transferred to NLDS are first written to the object storage, and a backup is made to tape.  When the object storage is approaching capacity, a set of policies is interrogated to determine which files will be removed from it.  Upon retrieving a file, NLDS may have to first transfer the file from tape to the object storage, if it has been deleted by the policies.  This implements a multi-tier of hot (disk), warm (object storage) and cold (tape) storage via a single interface. While systems like this are not novel, NLDS is open source, designed for ease of redeployment elsewhere, and for use from both local storage and remote sites. 

NLDS is based around a microservice architecture, with a message exchange brokering communication between the microservices, the HTTP API and the storage solutions.  The system is deployed via Kubernetes, with each microservice in its own Docker container, allowing the number of services to be scaled up or down, depending on the current load of NLDS.  This provides a scalable, power efficient system while ensuring that no messages between microservices are lost.  OAuth is used to authenticate and authorise users via a pluggable authentication layer. The use of object storage as the front end to the tape allows both local and remote cloud-based services to access the data, via a URL, so long as the user has the required credentials. 

NLDS is a a scalable solution to storing very large data for many users, with a user-friendly front end that is easily accessed via cloud computing. This talk will detail the architecture and discuss how the design meets the identified use cases.

How to cite: Massey, N., Leland, J., and Lawrence, B.: A Scalable Near Line Storage Solution for Very Big Data, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-12785, https://doi.org/10.5194/egusphere-egu23-12785, 2023.