Intro:
For decades, software has used checkpoints to save the system state. It can be as simple as dumping a chunk of data from memory to storage, or completing a section inside a piece of code, or using it as a method of acknowledging that a piece of code or software was executed and tested properly.
But what about in the AI world? We've been hearing from customers and partners that checkpointing AI systems is becoming a challenge, so in this three-part blog series on AI checkpointing, we first explore some fundamentals and why they are important. In the second blog we will discuss the performance requirements for checkpoints. Lastly, in our third blog, we will show how Micron and Lightbits storage can solve all sorts of checkpoint woes.
What is a checkpoint in AI?
A checkpoint is saving the state of an AI training job so that if something goes wrong, training can be restarted from the last time the data was safely saved and the need to restart the job from the beginning can be avoided. This is like using the “save game” option in any video game.
AI training has been an important enterprise workload for more than five years, so why are we discussing checkpointing now? The answer is twofold. It's about the types of models being trained and about one of the fundamentals of how training is being done.
The rise of transformer models
Transformer models have become ubiquitous in data centers. Explaining transformer models is outside the scope of this blog but the primary effect we’re concerned with is the enablement of parallelism.
Transformer models (like ChatGPT, Llama, Gemini, etc.) can take sequence data (like sentences and paragraphs) and turn it into vectors that can be parallelized across a large amount of hardware.
With previous models, adding more GPUs to a cluster would quickly have diminishing returns on training performance, but the enablement of parallelism has led to the explosion of large AI training clusters. (In this case, large is 1,000s of servers with 10,000s of GPUs.)
Fundamentals of training
Let’s break training down into a couple of steps:
- Training new data (learn updated model weights from the data)
- Share the updated model weights between all GPUs
Modern training is doing these two steps synchronously, which means that all the GPUs stop training to share updated model weights before continuing to train on new data. Standard training processes are all tied together in such a way that if any component fails during the training period, the entire cluster must stop and reload from a previous checkpoint.
Failure rate of components
The next puzzle piece is understanding how failure rates combine when we increase the number of components that can fail.
For the statistics to work, we’ll need to use the chance of a GPU not failing while training on a batch of data.
What all of this means is that when we’re estimating the failure rates of a cluster, it goes as the power of the size of the cluster. We can visualize this as the mean time to failure (MTTF):
We see that as cluster sizes increase, the MTTF reaches surprisingly low values. At 24,000 accelerators, we’re estimating a failure every two hours. The far-right side of the plot shows that for the very large clusters being deployed today, the MTTF is under 30 minutes.
Why storage considerations are important to checkpointing
Now that we understand more about training and its flow, let's look at some real-life scenarios.
If we look at a large open Large Language Model like Llama 3 with 405 billion parameters, the checkpoint size is about 5TB.
Assume that we want to checkpoint 20 times between failures to achieve 95% good cluster throughput.
This means that for 24,000 accelerators with a MTTF of 124 minutes, we will checkpoint every 6 minutes. This results in 233 checkpoints per day (5TB each), resulting in 1.1 Petabytes of checkpoint data being written by the cluster every day. Averaged across 24 hours of operation, this requires a sustained write throughput of 13 GB/s across the cluster.
From this example, we can see that checkpointing is dependent on storage performance. This is why Micron and Lightbits are supporting the MLPerf Storage benchmark suite (part of MLCommons) to create a benchmark to test checkpointing processes and understand the cluster and storage requirements when architecting AI solutions.
Conclusions and next blog
In this first blog, we discussed why checkpointing is an important factor in training and why having a performant storage solution for checkpointing is important.
Stay tuned for our next blog in this series, where we’ll discuss checkpoint storage requirements at scale (enterprise, cloud providers, and hyperscale).
To learn more about advanced storage and memory solutions from Lightbits and Micron, visit: www.lightbitslabs.com/micron/.