Interview: Nvidia on AI workloads and their impacts on data storage

Artificial intelligence (AI) workloads are quite unlike those we’ve seen previously in the enterprise. And across the different phases of AI work, input/output (I/O) profile and impacts on storage can vary dramatically.

After intense training, we put AI to work inferencing from what it has learned. Also, we must take into account AI frameworks used and their characteristics, plus the demands on storage of retrieval-augmented generation (RAG) referencing and checkpointing.

We asked about all this when we met up with Nvidia’s vice-president and general manager of DGX Systems, Charlie Boyle, at the recent Pure Storage Accelerate event in Las Vegas.

In this first of a two-part series, Boyle talks about the key data challenges for customers embarking on AI projects, practical tips for customers beginning with AI, and differences across AI workload types, such as training, fine-tuning, inference, RAG and checkpointing.

What’s the biggest challenge with regard to data for AI that you see for customers?

The biggest challenge is knowing what data is good for your AI, what data is bad for it, and what maybe doesn’t matter.

Good data is going to provide better insights and more accurate results. Obviously, whether you’re doing a chatbot or anything else, this data is going to provide the right answer to the end user. What I would think of as bad data is data that could cloud the answer and that’s not adding value.

That could be data that’s old. If I’m doing a customer service chatbot, and it’s a support ticket from 15 years ago, is that helpful? Maybe it is, maybe it isn’t. You, in your own enterprise, in your own domain, have to make that distinction.

If it’s a helpdesk question from 15 years ago, it might ask, is your phone line connected to your modem? Not applicable to you anymore.

But in a manufacturing context with infrastructure in a factory, some of those capital assets could have been in use for 20, 30 years. So that one support ticket for an issue that happened 15 years ago, that only happens once a decade on that manufacturing product, may be super useful to you.

So a big part of AI for enterprise is understanding the data you have. When we talk about getting started with AI, it’s always easier to start with data you feel is safe.

If I’m going to do a chatbot and my training dataset will be IT trouble tickets from the past 24 months, that’s probably super safe. It’s an internal thing. It was curated by experts because the IT people took the issue and wrote notes. Or it’s corporate data that’s been vetted, it’s information from my press releases, from my SEC filings, for example, things that I know legally had to be accurate.

Or here’s information from all of my publicly available marketing data on the website such as datasheets and product information. A human being looked at that and thought they wrote it correctly. So that’s easy to get started with.

But then as an enterprise, you think, I’ve got 20 years’ worth of data. What should I do with all of these things? Can I create insights? And that’s what you need for that first AI win. You need to show people that it’s useful. And then stepwise, go through, what would be the next most useful thing to my users? Those users could be internal or external users.

Create a hypothesis. It’s easy enough to do AI training by fine-tuning existing models. You don’t need to wait six months to build a foundational model like GPT-3 or GPT-4 anymore.

You can use an off-the-shelf model like Llama, fine-tune that for your domain, and do that in a couple of weeks. Or a day, depending on the model size and your compute infrastructure.

Adshead: What are your key tips for a customer that wants to put AI to work?

The first thing would be, there’s a ton of ready-made AI applications that you just need to add your data to. We’ve got a big catalogue on the Nvidia site. There are sites like Hugging Face, those types of things, where users have not only used the models, but they’ve commented on them.

The most common thing we see is chatbots. Even my most advanced AI users, people who have PhDs in this stuff, as I talk to them, it’s like, guys, you don’t need to code the chatbot. All the chatbot examples exist in the world.

You don’t need to code the chatbot. All the chatbot examples exist in the world. Pick one to start with that. Customise it for your own needs. You don’t need a PhD to get started in AI Charlie Boyle, Nvidia

Pick one to start with that. Customise it for your own needs. You don’t need a PhD to get started in AI.

So pick an off-the-shelf model. In many places, including our own site, you can completely try the off-the-shelf model completely online. You need to put none of your own data into it.

So you can say, experiment with, for example, what does this type of model do for me? What types of questions can I answer with it? You can decide if that’s useful for your business, if it would make a good IT chatbot or a good customer service lookup.

If you’ve got a massive website or product documentation library, that’s an easy, safe thing to put a chatbot in front of.

As an IT user, as an enterprise user, you don’t need to be a chatbot expert to come up with the model. Models exist. You just needed to feed it your own data. Pick a model that you think works and put your own data into it.

But put data into it that’s publicly available, because you don’t have any compliance risk there. So it’s not like, oops, I leaked some company confidential information. If I train it on a website that’s all publicly available information, then you’re safe.

And, once you’ve got past those couple of experiments, look at some model catalogues to see if there is an example that would solve a specific pain point in your business that you’re willing to dedicate a month or three months’ worth of project effort into.

What are the differences in terms of I/O profile between training, fine-tuning training, inference, work with RAG, the different frameworks used in AI? What are the demands of checkpointing? And what do they demand of storage?

If it’s a large model you’re training from scratch you need very fast storage because a lot of the way AI training works is they all hit the same file at the same time because everything’s done in parallel. That requires very fast storage, very fast retrieval. It’s mostly read-oriented.

With checkpointing, it is very I/O-intensive because there’s a proportional ratio to the training dataset. If you just had one node doing training, the likelihood of one node going down, the network connection of the one node, is very small. So, if I can accomplish my training on one node and it’s going to take four hours to do that training run, I probably don’t need to checkpoint.

In the unlikely event something did happen, I can re-do four hours. Then there’s the opposite extreme, which we see a lot in very large language models or self-driving car technology, where the training run may take three weeks, may take three months, may have thousands of compute nodes on that. You’re guaranteed that with a cluster that big and a training set that long, something is going to happen.

A cosmic ray is going to hit something in that cluster that’s going to cause some error. And if you don’t checkpoint, you could have gotten, for example, two months in, and if you don’t checkpoint you have to completely start over.

So then the question is how often do I checkpoint? Because when I’m doing a checkpoint, all compute stops. And it’s all about writes. And it’s everyone writing at the same time.

When you checkpoint, in an ideal cluster everyone finishes at the exact same time. In a well-tuned cluster, they’re within a few seconds. And then occasionally, on a very large cluster, you may have some nodes that, for whatever reason, might be a little slower than others. Maybe they drift on a couple minutes.

But when everyone says, I’ve gotten to the 10km mark, everyone stops and everyone writes. Depending on how big your model is, how big your data is, that could be a very long write. Sometimes that write is over an hour.

Source