Vast targets AI checkpointing write performance with distributed RAID

Vast Data will boost write performance in its storage by 50% in an operating system upgrade in April, followed by a 100% boost expected later in 2024 in a further OS upgrade. Both moves are aimed at checkpointing operations in artificial intelligence (AI) workloads.

That roadmap pointer comes after Vast recently announced it would support Nvidia Bluefield-3 data processing units (DPUs) to create an AI architecture. Handily, it also struck a deal with Super Micro, whose servers are often used to build out graphics processing unit (GPU)-equipped AI compute clusters.

Vast’s core offer is based on bulk, relatively cheap and rapidly accessible QLC flash with fast cache to smooth reads and writes. It is file storage, mostly suited to unstructured or semi-structured data, and Vast envisages it as large pools of datacentre storage, an alternative to the cloud.

Last year, Vast – which is HPE’s file storage partner – announced the Vast Data Platform that aims to provide customers with a distributed net of AI and machine learning-focused storage.

To date, Vast’s storage operating system has been heavily biased towards read performance. That’s not unusual, however, as most workloads it targets major on reads rather than writes.

Vast therefore focused on that side of the input/output equation in its R&D, said John Mao, global head of business development. “For nearly all our customers, all they have needed are reads rather than writes,” he said. “So, we pushed the envelope on reads.”

To date, writes have been handled by a simple RAID 1 mirroring. As soon as data landed in the storage, it was mirrored to duplicate media. “It was an easy win for something not many people needed,” said Mao.

The release of version 5.1 of Vast OS in April will see a 50% improvement in write performance, with 100% later in the year with the release of v5.2.

The first of these – dubbed SCM RAID – comes from a change that sees writes distributed across multiple media, said Mao, with data RAIDed (in a 6+2 configuration) as soon as it hits the write buffer. “To boost performance here, we have upgraded to distributed RAID,” said Mao. “So, instead of the entirety of a write going to one storage target, it is now split between multiple SCM drives in parallel, cutting down on time taken per write.”

Later in the year, version 5.2 will detect more sustained bursts of write activity – such as checkpoint writes – and automatically offload those writes to QLC flash, in a set of functionality known as Spillover. “The one case where it will be very useful is in [write operations in] checkpointing in AI workloads,” he said. “You can have, for example, clusters of tens of thousands of GPUs. It can get very complex. You don’t want that many GPUs running and something goes wrong.”

Checkpointing in AI periodically saves model states during AI training. It allows the model to be rolled back should a disruption occur during processing.

Vast recently announced it will support Nvidia Bluefield-3 DPUs in a move that will position itself as storage for large-scale AI workloads.

Bluefield-3 is a smart NIC with ARM 16-core processors that allows customers to offload security, networking and data services. Usually on GPU-equipped servers.

Vast also announced a partnership with Super Micro in which Vast Data software is ported to commodity servers. “We’re talking x86 systems that build out to PB of storage,” said Mao. “Reading what’s between the lines, Super Micro sells a lot of Nvidia GPU-equipped servers that will have Bluefield on board, so it’s a good fit for Vast.”

Source