DESIGN TOOLS

Invalid input. Special characters are not supported.

AI

Inference = IOPS: Why AI’s next frontier runs on storage

Jeremy Werner | May 2025

Inference used to be the quiet follow-up act to training, an afterthought even. But everything has changed seemingly overnight. Today, inference is the main event in AI infrastructure — and storage is stepping into the spotlight.

Every time you ask a chatbot a question, generate an image or run a “Copiloted” task, inference is doing the work. These aren’t predictable, repeatable processes like training. Inference is on demand, in real time, and shaped entirely by user behavior. That makes it a lot messier — and much harder to optimize.

Imagine navigating through a busy city during rush hour. Every driver has a unique destination, and the traffic patterns are constantly changing. You need to make real-time decisions based on the current conditions, adjusting your route to avoid congestion and reach your destination efficiently. This unpredictability and need for quick adjustments mirror the randomness of inference in AI. Each of your interactions triggers a unique set of processes and computations, demanding high performance and responsiveness from the system.

Inference = IOPS

The reality is this: Unlike training workloads, inference workloads don’t run in a straight line. They loop back, refine and reprocess. That means each interaction triggers a flurry of reads, writes and lookups. Those input/output operations per second (IOPS) add up fast. Inference doesn’t just need high capacity, it also needs high performance. Compute gets most of the headlines, but it’s storage that’s constantly “feeding the beast.”

And as these models scale — serving billions of users like you in near real time — the pressure on infrastructure grows exponentially. AI innovation must move at the speed of light, but it can only move as fast as its slowest component.

Yann LeCun, Meta’s chief AI scientist, said it well, “Most of the infrastructure cost for AI is for inference: serving AI assistants to billions of people.”

That scale translates directly into a need for faster, more responsive storage systems — not just high capacity but also high IOPS. Inference applications can drive hundreds or even thousands of times the concurrent I/O of historical CPU-based computing applications.

Inference = IOPS

At Micron, we’re seeing this shift play out in real-world deployments. Customers running large language models (LLMs) and other inference-heavy workloads are looking for ways to reduce tail latency and boost responsiveness under unpredictable loads.

That’s where drives like the Micron 9550 - and our next-gen PCIe Gen6 NVMe SSDs - are making a real difference. These aren’t general-purpose storage devices. They’re engineered specifically for data-intensive, low-latency environments like AI inference.

Inference = IOPS

NVIDIA’s Jensen Huang recently pointed out, “The amount of computation we need … as a result of agentic AI, as a result of reasoning, is easily a 100 times more than we thought we needed this time last year.”

It’s not just the models getting smarter. The infrastructure needs to keep up — across the stack. And that upkeep includes storage, especially in systems where inference happens across a swarm of GPUs, accelerators and memory tiers.

As use cases grow — chatbots, search, Copilots and embedded AI at the edge — the entire I/O pipeline is being reevaluated. What’s the point of a blazing-fast compute fabric if your storage can’t keep pace?

Inference = IOPS

The era of inference is upon us, driving the demand for IOPS — and Micron is leading the charge.

Senior Vice President & General Manager, Storage Business Unit

Jeremy Werner

Jeremy is an accomplished storage technology leader with over 20 years of experience. At Micron he has a wide range of responsibilities, including product planning, marketing and customer support for Server, Storage, Hyperscale, and Client markets globally. Previously he was GM of the SSD business at KIOXIA America and spent a decade in sales and marketing roles at startup companies MetaRAM, Tidal Systems, and SandForce. Jeremy earned a B.S.E.E. from Cornell University and holds over 25 patents or patents pending.