The Micron 9550 high-performance SSD is a game-changer for AI workloads. Using our battle-tested Micron G8 NAND, leading-edge controller and vertical integration of all key components in the SSD, we’ve built a drive that is best in class for performance AND uses less power.
My team tested the Micron 9550 U.2 7.68TB drive in four cutting-edge AI workloads. Our results prove that the Micron 9550 is the best data center SSD for AI systems.
Four workloads, best in class
In all tested workloads, the Micron 9550 not only gets the work done faster, but it also uses less average power, meaning large savings in SSD energy used (workload time x average power). Let’s look at each of these workloads in more detail.
Graph neural network training: Big accelerator Memory
Big accelerator Memory (BaM) and GPU-Initiated Direct Storage (GIDS) replace the NVMe driver and use high GPU thread parallelization to enhance performance on PCIe® Gen5 SSDs with the NVIDIA® H100. This workload demands the highest-performance small block input/output (IO) we’ve ever tested.
This synthetic test is like FIO (flexible IO) but initiated from the H100 GPU. Here we see the Micron 9550 hitting 3.4 million input/output operations per second (IOPs). We also graphed IOPs per watt, showing the Micron 9550 is up to two times more energy efficient than the competition.
What does this look like in a real AI training workload?
- Higher performance: Using BaM with the Micron 9550 while training a graph neural network on an H100 provides 33% faster performance because of the 60% higher throughput of the SSD.
- Lower SSD power: Regarding power, the Micron 9550 is using 16.6W to hit 2.9 million IOPs, resulting in 43% less SSD energy used to get the work done.
- Less system energy used: Looking at the system power draw, the speed and efficiency of the Micron 9550 reduces the total system energy used by 29%.
For high-performance, storage-bound workloads like BaM, the power efficiency of the Micron 9550 directly translates into lowering system energy, saving power and controlling costs in the data center.
Unet3D medical image segmentation with MLPerf Storage
The MLPerf Storage benchmark simulates the Unet3D AI training workload by laying out files of the exact size that the medical image segmentation model uses. It then processes them using Tensorflow and Pytorch, emulating a GPU by inserting a sleep time where the GPU would be running the training operations. This process can be tuned to show how much throughput different GPUs require to run a given model.
- Higher performance: Here we see a 5% increase in performance because the workload is large block read heavy, where all the SSDs perform similarly. This IO pattern is typical of many AI training workloads.
- Lower SSD power: What’s different about the Micron 9550 is that it uses 32% less average SSD power while achieving that 5% performance increase.
- Less SSD energy used: Higher performance and lower average SSD power translate to 35% less SSD energy used during this workload.
Saving power at the SSD level allows more flexibility in the AI training server power budgets and enables GPU dense designs.
Large language model inference with DeepSpeed ZeRO-Inference
DeepSpeed ZeRO-Inference is designed to enable LLMs that don’t fit in main memory to work by using the SSD to intelligently offload.
The first test shows synthetic reads and writes representing the maximum performance an LLM will see with a given SSD.
Reads are more common in inference workloads. Here we see 15% higher throughput and 27% lower SSD power, resulting in SSD and system energy savings of 37% and 19%, respectively.
Writes will be far less common but do happen during checkpointing or during retrieval-augmented generation (RAG) workloads. Here we see 78% higher throughput for the Micron 9550 with 22% less SSD power, resulting in 51% less SSD energy used and 43% less system energy used.
What does this look like on Meta Llama 3 70B?
- Slightly higher performance: The Meta Llama 3 70-billion-parameter model on a system with two NVIDIA L40S inference accelerators shows a small increase in tokens per second with the Micron 9550 because this workload is 99% 256KB random reads and all the SSDs tested perform similarly with that IO pattern. The workload is also GPU compute bound.
- Lower SSD power: We see 19% lower SSD power on the Micron 9550, resulting in 21% less SSD energy used.
- Less system energy used? System energy wasn’t greatly affected here because the energy use of two L40S was much higher than a single SSD. Deployed at scale, a 2% system power savings can still be significant.
The Micron 9550 uses 19% less power and 21% less energy to achieve a similar level of performance in a GPU-bound workload. If the storage subsystem uses less power, system architects will have additional power headroom to fit more GPUs in an inference system.
NVIDIA GPUDirect® Storage
Finally, let’s look at NVIDIA GPUDirect Storage (GDS). Here we’re generating IO from the NVIDIA H100 GPU at different IO sizes, reading data directly from the Micron 9550 and bypassing the CPU+DRAM bounce buffer.
- Higher performance: Throughput ranges from 9% to 34% higher than the competition. The Micron 9550 is much faster with small block IO. As IO sizes increase, the drives become more similar.
- Lower SSD power: The Micron 9550 draws up to 30% less power.
- Less SSD energy used: The Micron 9550 uses up to 66% less energy while transferring 1TB of data.
Higher performance, lower SSD power, less energy used
An obvious pattern emerged from these four AI workloads: The Micron 9550 uses less power while performing better, which translates to significant energy savings at the SSD and system levels.
AI workloads continue to push the limits of data center system performance and are beginning to drive extremely high-performance requirements into data center SSDs. The Micron 9550 was designed to tackle that emerging challenge; the proof is in the workload performance.