On Wednesday, Feb. 28, micron.com will be upgraded between 6 p.m. - 12 a.m. PT. During this upgrade, the site may not behave as expected and pages may not load correctly. Thank you in advance for your patience.

Storage

Maximize Your Investment in Micron SSDs for AI/ML Workloads With NVIDIA GPUDirect Storage

By Wes Vaske - 2020-07-13

Introduction

Here at Micron, we are excited about the enormous innovations driven by artificial intelligence (AI). For this reason, we have been very active in exploring what effect flash can have on this new class of applications and how we can work with other amazing technologies — both hardware and software — to accelerate innovations through AI.

I recently had the chance to test-drive an early release of the NVIDIA GPUDirect Storage software in our lab, and the results are pretty impressive. Before I dive into the results, however, how about some quick background?

Since 2007, NVIDIA CUDA has enabled GPU-accelerated processing of many different compute-bound tasks. And with the introduction of GPUDirect RDMA in 2014, NVIDIA enabled direct data movement between GPUs and a variety of PCIe adapters such as network interface cards, storage controllers and video capture devices. GPUDirect Storage expands on GPUDirect RDMA and CUDA to support direct movement of data from the storage device (such as an NVMe SSD) directly to the GPU memory space. This update, in effect, removes the data movement to CPU system memory (used for I/O buffering), a step required with a legacy storage I/O configuration (Figure 1).

Figure 1: Comparison of data path with and without GPUDirect Storage

Figure 1: Comparison of data path with and without GPUDirect Storage

Until recently, however, GPU acceleration of many standard data science libraries has been lacking. In fact, the most commonly used Python data science libraries (NumPy, Pandas, scikit-learn) do not have native GPU acceleration, and some even have specific plans not to develop such acceleration.

While other libraries are looking to enable GPU support in ways that are compatible with Pandas, NumPy and scikit-learn, there has not been an holistic GPU-enabled set of libraries until now. With the introduction of the RAPIDS open-source software library, based on CUDA, we now have a suite of open-source libraries and APIs that provide the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.

With these new GPU-accelerated RAPIDS libraries — and others — providing compute performance many multiples that of non-accelerated libraries, we are finding storage I/O is becoming a bottleneck, and this is the area that GPUDirect Storage is specifically designed to help.

While logically, this is an easy problem to solve (NVMe drives, NICs and HBAs all have DMA engines that can support the direct transfer of data to GPU memory addresses), the actual implementation is a bit more complicated. I refer you to NVIDIA's blogs and corporate communications and specifically an introductory blog about GPUDirect Storage to learn more. Additionally, while RAPIDS is a specific use case for GPUDirect Storage, that's just one of many possible implementations.

Now let's look at how GPUDirect Storage improves performance using some real testing on real systems.

Test Configuration

For all the tests here, I'm using a SuperMicro SYS-4029GP-TVRT with 8x NVIDIA V100 GPUs, 2x Intel Xeon 8180M CPUs (28 cores each) and 8x Micron 9300 Pro 15.36TB NVMe SSDs. Figure 2 shows the specific layout of the system.

Figure 2: PCIe layout of SuperMicro SYS-4029GP-TVRT

Figure 2: PCIe layout of SuperMicro SYS-4029GP-TVRT

One interesting architectural feature of the server used in our testing is that the NVMe SSDs are directly connected to the CPUs. This specific system could have better throughput using remote storage (e.g., accessed via NVMe over Fabrics) versus local NVMe. Why? Twice as many PCIe lanes are used for NICs as for NVMe drives. Since GPUDirect Storage uses DMA to move data from storage devices directly to GPUs, it could also use RDMA and NVMe-oF. If higher-performance network interfaces access those devices via their common PCIe switch, then we can move more data than if those storage devices were directly connected through the CPU as required for this server. We plan to explore the performance of GPUDirect Storage with remote storage accessed via network adapters connected to the same PCIe switches as the GPUs in the future, but the data here is only discussing local NVMe.

Performance Results

4KB random read performance

We start with a typical 4KB random read workload. Here each GPU is reading from a 1TB file on a Micron 9300 NVMe SSD. There is a one-to-one relationship — each GPU is reading exclusively from a single NVMe drive. This relationship isn't necessarily how you would configure a production system, but in the lab, it makes testing different numbers of GPUs and SSDs easier. For a production system, you would configure all the drives on a single CPU into a single RAID group.

Figure 3: CPU utilization and throughput by worker count per GPU-NVMe pair for 8x GPUs by data path for 4K I/O transfer size

Figure 3: CPU utilization and throughput by worker count per GPU-NVMe pair for 8x GPUs by data path for 4K I/O transfer size

In Figure 3, we see a significant impact in performance supplied by GPUDirect Storage versus the legacy data path that uses the CPU “bounce buffer.” At low worker counts, the GPUDirect Storage path looks slightly faster but with slightly better CPU utilization. As worker counts continue to increase, we see the advantage of GPUDirect Storage increase dramatically. In this configuration, GPUDirect Storage reaches a peak of 8.8 GB/s throughput (blue bar for 64 workers) while keeping CPU utilization equivalent to less than six CPU cores (green line, right axis). In contrast, the legacy path peaks at 2.24 GB/s of throughput (gray bar at 12 workers) and under the stress of increased worker counts loses performance and increases CPU utilization to the equivalent of 52 fully loaded CPU cores in the case of 96 workers (black line, right axis).

Impact of data transfer size on performance

Next, we examine the impact of data transfer size on performance. For this test, we use 16 workers per GPU-NVMe pair. The selection of 16 workers, based on the results of the previous 4KB I/O test, shows that 16 workers are just past the peak for the legacy data path and well before the peak for the GPUDirect Storage data path.

Figure 4: CPU utilization and throughput by I/O transfer size for 8x GPUs by data path with 16 workers per GPU-NVMe pair

Figure 4: CPU utilization and throughput by I/O transfer size for 8x GPUs by data path with 16 workers per GPU-NVMe pair

Looking at Figure 4, we find that, as we scale the I/O transfer size, inherent limitations of the legacy data path are mitigated. At large transfer sizes, the throughput and CPU utilization become effectively equivalent. Unfortunately, it's often challenging to ensure all workloads that access your storage are consistently using large data transfer sizes. Therefore, the story for small to medium transfers is the same as above — substantial throughput improvements and correspondingly better CPU utilization when using GPUDirect Storage.

I/O latency

We look at one more important aspect of performance, I/O latency. For this test, we're again looking at 16 workers and scale the I/O transfer size.

Figure 5: Latency and throughput by IO transfer size for 8x GPUs by data path with 16 workers per GPU-NVMe pair

Figure 5: Latency and throughput by IO transfer size for 8x GPUs by data path with 16 workers per GPU-NVMe pair

As with throughput and CPU utilization, latency for the GPUDirect Storage data path sees significant improvements at small to medium transfer sizes and is effectively equivalent at large transfer sizes when compared to the legacy data path (Figure 5). Latency presented here is in microseconds (µs), and the values we see for GPUDirect Storage are astounding. 4KB transfers have an average latency of 116 µs and increase to only 203 µs at 32KB transfers. For latency-sensitive workloads, this can have a significant effect on application performance. Beyond 32KB transfers, we're in a realm where latency becomes dominated by the time to transfer data instead of the data access overhead, as is the case with smaller I/O transfers.

Conclusion

At small and medium block sizes, GPUDirect Storage supplies considerable increases in total throughput, significant decreases in latency and significant decreases in the number of CPU cores required. Any of these improvements individually would make this technology worth exploring for any application requiring GPU acceleration; experiencing all these improvements simultaneously marks a step-function change in the GPU I/O landscape. Larger block transfers — in my testing — don't see the same benefits, but there are no downsides to using GPUDirect Storage versus the legacy data path.

NVIDIA’s GPUDirect Storage combined with Micron’s data center SSDs accelerate your AI solutions, allowing you to do more in less time. To learn how Micron and NVIDIA can help you deploy successful AI solutions, check out these helpful resources:

Wendy Lee-Kadlec

Wes Vaske

Wes Vaske is a Senior Member of Technical Staff on the Micron Data Center Workloads Engineering team in Austin Texas. He analyzes enterprise workloads to understand the performance effects of Flash and DRAM devices on applications and provides 'real-life' workload characterization to internal design & development teams. Wes's specific focus is Artificial Intelligence applications and developing the tools for tracing and system observation.

+