Is checkpointing holding you back?

Intro:

For decades, software has used checkpoints to save the system state. It can be as simple as dumping a chunk of data from memory to storage, or completing a section inside a piece of code, or using it as a method of acknowledging that a piece of code or software was executed and tested properly.

But what about in the AI world? We've been hearing from customers and partners that checkpointing AI systems is becoming a challenge, so in this three-part blog series on AI checkpointing, we first explore some fundamentals and why they are important. In the second blog we will discuss the performance requirements for checkpoints. Lastly, in our third blog, we will show how Micron and Lightbits storage can solve all sorts of checkpoint woes.

What is a checkpoint in AI?

A checkpoint is saving the state of an AI training job so that if something goes wrong, training can be restarted from the last time the data was safely saved and the need to restart the job from the beginning can be avoided. This is like using the “save game” option in any video game.

AI training has been an important enterprise workload for more than five years, so why are we discussing checkpointing now? The answer is twofold. It's about the types of models being trained and about one of the fundamentals of how training is being done.

The rise of transformer models

Transformer models have become ubiquitous in data centers. Explaining transformer models is outside the scope of this blog but the primary effect we’re concerned with is the enablement of parallelism.

Transformer models (like ChatGPT, Llama, Gemini, etc.) can take sequence data (like sentences and paragraphs) and turn it into vectors that can be parallelized across a large amount of hardware.

With previous models, adding more GPUs to a cluster would quickly have diminishing returns on training performance, but the enablement of parallelism has led to the explosion of large AI training clusters. (In this case, large is 1,000s of servers with 10,000s of GPUs.)

Fundamentals of training

Let’s break training down into a couple of steps:

Training new data (learn updated model weights from the data)
Share the updated model weights between all GPUs

Modern training is doing these two steps synchronously, which means that all the GPUs stop training to share updated model weights before continuing to train on new data. Standard training processes are all tied together in such a way that if any component fails during the training period, the entire cluster must stop and reload from a previous checkpoint.

Failure rate of components

The next puzzle piece is understanding how failure rates combine when we increase the number of components that can fail.

For the statistics to work, we’ll need to use the chance of a GPU not failing while training on a batch of data.

What all of this means is that when we’re estimating the failure rates of a cluster, it goes as the power of the size of the cluster. We can visualize this as the mean time to failure (MTTF):

We see that as cluster sizes increase, the MTTF reaches surprisingly low values. At 24,000 accelerators, we’re estimating a failure every two hours. The far-right side of the plot shows that for the very large clusters being deployed today, the MTTF is under 30 minutes.

Why storage considerations are important to checkpointing

Now that we understand more about training and its flow, let's look at some real-life scenarios.

If we look at a large open Large Language Model like Llama 3 with 405 billion parameters, the checkpoint size is about 5TB.

Assume that we want to checkpoint 20 times between failures to achieve 95% good cluster throughput.

This means that for 24,000 accelerators with a MTTF of 124 minutes, we will checkpoint every 6 minutes. This results in 233 checkpoints per day (5TB each), resulting in 1.1 Petabytes of checkpoint data being written by the cluster every day. Averaged across 24 hours of operation, this requires a sustained write throughput of 13 GB/s across the cluster.

From this example, we can see that checkpointing is dependent on storage performance. This is why Micron and Lightbits are supporting the MLPerf Storage benchmark suite (part of MLCommons) to create a benchmark to test checkpointing processes and understand the cluster and storage requirements when architecting AI solutions.

Conclusions and next blog

In this first blog, we discussed why checkpointing is an important factor in training and why having a performant storage solution for checkpointing is important.

Stay tuned for our next blog in this series, where we’ll discuss checkpoint storage requirements at scale (enterprise, cloud providers, and hyperscale).

To learn more about advanced storage and memory solutions from Lightbits and Micron, visit: www.lightbitslabs.com/micron/.

SMTS Systems Performance Engineer

Wes Vaske

Wes Vaske is a Senior Member of Technical Staff (SMTS) and Systems Performance Engineer at Micron Technology. With a strong background in storage solutions and AI infrastructure, Wes plays a pivotal role in advancing Micron’s capabilities in data intelligence and machine learning. He is known for his expertise in benchmarking AI training systems and optimizing storage performance to meet the demands of next-generation GPUs. Before joining Micron, Wes was a Systems Engineer at Dell. He holds a Bachelor’s degree from Iowa State University.

Distinguished Performance Architect

Lightbits Labs guest author Sagy Volkov

Sagy Volkov is the former Performance Architect of the Openshift Container Storage (OCS, now ODF) in red hat. Prior to that he ran the performance engineering group and the enterprise advocates group at ScaleIO (Now Dell PowerFlex) and also architected the ScaleIO storage appliance. He is now with Lightbits Labs as a distinguished performance architect concentrating on NVMe/TCP performance and application performance using NVMe/TCP. Sagy has spoken previously in KubeCon, DoK, CNS, DevConf, EMC World and Red Hat summit.

Products overview

Search for, filter and download Micron data sheets

Market & Industries overview

AI data center

Partners overview

Learn about and enroll in Micron's Technology Enablement Program (TEP)

Sales & Support overview

Contact Micron's sales support

About overview

Investor Relations overview

Investor Relations overview site

Recent Search