Artificial Intelligence and Machine Learning Demand High Performance Storage Series, part two: Training
In the previous blog in my series on artificial intelligence and machine learning, I introduced you to what AI and ML are, and presented a high-level process overview of how data is transformed and used to create “intelligent” responses by an AI system. In that blog, we explained how large amounts of data must be ingested - from multiple and disparate sources - and then transformed into a usable format that will be used for the next step in the AI process: training.
And that is where we will pick up in Part Two today. As a reminder, the image below represents a typical AI workflow that consists of four primary components: ingest, transform, train, and execute. In the previous blog, we covered the ingest and transform processes since they typically will happen together as data is prepared for use in the AI training process.
The training step is typically an extremely resource-intensive part of the process, though as we will see in a future blog post, inference can be even more resource intensive. This is where really hefty hardware, typically in the form of graphics processing units (GPUs) with lots of fast memory, are used. The train phase of the workflow involves a repetitive set of steps executing a set of mathematical functions on the data ingested designed to identify a desired response/result with a high level of probability. The results are then evaluated for accuracy. If the accuracy is not acceptably high - and this typically means in the 95+ percent range – the mathematical functions are modified and then tried again by applying the updates to the same data set.
A classic example for a typical AI use case is simple image recognition. In this example the best known model and test data set for image recognition is called ImageNet and a set of functions called ResNet. I won’t go into detail here, but the ImageNet training data set is 1.2 million images and takes around 145GB of data storage. ResNet has varying degrees of complexity, but the typical one used is ResNet-50 (there is also a ResNet-101 and -152). The number represents the how many neural network “layers” of different mathematical functions called “neurons” are being used (also representing the complexity of the AI model).
So, what does this all have to do with a discussion about storage and memory - aspects that are of extreme interest to us here at Micron? Well, the training process - like the ingest/transform stage before it - can be a time-consuming and complex process. But unlike the ingest/transform stage, the train stage depends on high-performance compute to execute the mathematical functions.
In our testing we have found that the amount of fast storage and memory available to the solution directly impacts the amount of time it takes to complete a given training run. The faster we can complete each training run (called an epoch), the more epochs we can execute and the more accurate we can make our AI system while keeping training time relatively low. So, while we can use HDDs for our training data storage, rotating media is really slow. The GPUs cannot get the data fast enough to complete the training epoch in a timely fashion. SSDs are typically orders of magnitude faster (in terms of IOPS and latency) than HDDs. For this reason, it seems logical that if we can feed the training system faster, then we can complete the work more quickly.
Also, if we can increase the size of the data being fed to each epoch - what we call a “batch” - then we can run each epoch faster to get the same results. Thus, the more memory we can put into the system the better. But, while we could put 2TB or more DRAM in a server and call it a day, that can be very expensive. Most organizations are constantly balancing cost and efficiency. Based on our testing, we believe there are better results from focusing on faster storage (SSDs) and doing so at a much better price point. SSDs cost less than DRAM on a per-byte basis.
Wes Vaske – a Micron AI engineer – has run some tests that prove this. While his blog and recent webinar with Forrester go much deeper into the details of his testing, some of his results illustrate the impact of fast or slow storage and memory on the AI training process. Wes’s testing and charts clearly show that fast storage has just as much impact on overall performance as simply adding more memory. This is illustrated by looking at the two “Low Memory” values and comparing the “Fast Disk/Low Memory” bar (3rd bar) to the “Slow Disk/High Memory” bar (2nd bar). In this instance, buying faster storage, additional memory, or both has a dramatic impact on overall performance. Finding the right balance of these two resources will depend on the data set and models you want to execute.
Past testing shows us AI training is directly impacted by compute resources – such as adding GPUs, but this testing proves memory and storage resources have a direct impact on AI performance even with the same CPU/GPU combination. Micron is uniquely positioned to help you be successful in AI. While we cannot take all variables into account around your specific AI modelling requirements, our testing using ResNet-50 is a great illustration of how important storage and memory can be for AI workloads. I encourage you to get the details from Wes’ blog to learn even more.
Visit Micron.com/AI.
Stay up to date by connecting with us on Linkedin.