CPU compute capabilities have grown exponentially over time, and memory capacity requirements have been growing even faster. With CPU and GPU compute capabilities skyrocketing, we’re seeing that many workloads like AI training are now limited by the lack of addressable system memory. While virtual memory (SSD-based swap space) might help at the OS layer (it can stop the system from crashing from out-of-memory issues), it is not a good solution for expanding memory capacity in high-performance workloads.
One technology response to this issue is the introduction of Compute Express Link™, or CXL, which allows pools of memory to be connected and shared between several computing nodes and enables memory scalability by orders of magnitude over local DRAM. It also requires granular tiering of memory based on performance and locality, ranging from processor caches and GPU local memory (high-bandwidth memory (HBM) to much slower far memory with more esoteric coherence structures.
Generally, this expansion of memory capacity is only related to DRAM + CXL; storage is largely unaffected and the operation of NVMe SSDs shouldn’t be impacted, right? Not exactly. SSDs should be aware of tiered memory and can be optimized to improve performance, latencies, or both, for high-performance workloads. One SSD optimization for this use case requires the support of ATS and its related ATC that we’ll discuss here.
What is ATS/ATC?
The most relevant use for ATS/ATC is in virtualized systems. They can also be used in non-virtualized systems, but a simple description details how ATS/ATC works by tracing the data path of a virtual machine (VM) in a direct-assignment virtualized system, using standard techniques such as SrIOV. A reference diagram is shown below:
One of the key tenants of virtualization is that the guest VM does not know it is running on a virtualized environment. The system devices (like SSDs) believe they are communicating with a single host instead of a plethora of VMs, thus they function normally without additional logic on the SSD.
One consequence of this approach is on memory addressing. The guest OS is designed to run on a dedicated system, so it thinks it is using system memory addresses (SVA = system virtual address; SPA = system physical address) but in reality, it is running in a VM space with the hypervisor that manages it, providing guest addresses that are local to the VM but are entirely different from the system address mapping. The guest OS believes that it is using SVA and SPA, but it is actually using guest virtual address (GVA) and guest physical address (GPA).
Care must be taken during the conversion from the local (guest) to global (system) addressing scheme. There are two translation mechanisms in the processor: the memory management unit (MMU) that supports the memory addressed directly by programs (and is irrelevant for SSD) and the IOMMU that supports the translation of all DMA transfers, which is the critical aspect here.
As seen in Figure 1, every time the guest OS in the VM wants to perform a read from the SSD, it must provide a DMA address. It provides what it thinks is an SPA but is actually a GPA. What is sent to the SSD as a DMA address is just a GPA. When the SSD sends out the requested data, it sends PCIe packets (called transaction layer packets [TLPs], generally of up to 512B in size) with the GPA it knows. In this case, the IOMMU looks at the incoming address, understands it is a GPA, looks at its conversion table to determine what is the corresponding SPA, and replaces GPA with SPA so that the correct memory address can be used.
Why do we care about ATS/ATC for NVMe?
Having many SSDs or CXL devices in a system could cause a storm of address translations that could clog the IOMMU, causing a system bottleneck.
For example, a modern SSD can perform up to 6 million IOPS of 4KB each. Given TLP = 512B, each IO is split in 8 TLP so there are 48 million TLP to translate. Given that IOMMU is instantiated every 4 devices, it translates 192 million TLP every second. This is a large number, but things could be worse since TLPs go “up to 512B” but can also be smaller. If TLPs are smaller, translations are proportionally higher.
We need to find a way to reduce the number of translations. And this is what ATS is about: a mechanism to ask for translations ahead of time and reuse them for as long as they are valid. Given the OS page is 4KB, each translation is used for 8 TLP, proportionally reducing the number of translations. But pages can be consecutive and in most virtualized systems, the next valid granularity is 2MB, so each translation can be used for 8*2M/4K = 4096 consecutive TLP (or more if using TLP smaller than 512B). This brings the number of translations that IOMMU must provide from ~200M to well below 100K, thus easing any risk of clogging.
Modelling ATS/ATC for NVMe
In NVMe, the addresses of the submission and completion queues (SQ and CQ) are each used once for every command. Shouldn’t we keep a copy of such (static) translations? Absolutely. And this is exactly what ATC does: keep a cached copy of the most common translations.
The main question at this point is what DMA address patterns would the SSD receive so that ATS/ ATC can be designed around them? Unfortunately, no data or literature around this exists.
We approached this problem by building a tool that tracks all addresses received by the SSD and stores them so that they can be used for modeling purposes. To be significant, the data needs to come from some acceptable representation of real-life applications with enough data points to represent a valid dataset for our cache implementation.
We chose common workload benchmarks for a range of data center applications, ran them, and took IO traces for sections of 20 minutes each. This resulted in hundreds of millions of data points per trace to feed the modeling effort.
An example of the collected data is shown in the figure below, performed by running YCSB (Yahoo Cloud Server Benchmark) on RocksDB using XFS file system:
Data ATC evaluation for Storage:
- Characterization Method:
- Assume VM running standard workloads.
- Trace unique buffers addresses for each workload
- Map them into STU (2 MB) lower pages
- Build Python model of ATC
- Replay traces to model to validate hit rate
- Repeat for as many workload and config as reasonable
Observation: Multiple VM, as expected, has less locality that single image but scaling in not linear (4x size for 16x #VM)
To calculate cache requirements, we built a Python model of the cache and replayed the whole trace (all hundreds of millions of entries) and analyzed cache behavior. This enabled us to model changes to the cache size (number of lines), eviction policies, STU sizes, and any other parameters relevant to the modeling.
We analyzed between ~150 and ~370 million data points per benchmark and found that the unique addresses used typically measured in the tens of thousands, a great result for cache sizing. If we further map them on the most commonly use 2MB page (the smallest transmission unit, or STU), the number of pages is reduced to few hundreds or low thousands.
This indicates a very high reuse of buffers, meaning that even if the system memory is in the TB range, the amount of data buffers used for IOs is in the GB range and the number of addresses used is in the thousands range, making this a great candidate for caching.
We were concerned that the high address reuse is due to locality in our specific system configuration, so we ran additional testing against several different data application benchmarks. The table below compares one of the above YCSB/RocksDB/XFS tests with a TPC-H on Microsoft SQL Server using XFS, a fundamentally different benchmark:
Correlation with TPC-H:
- IO distribution very different:
- 3.2x as many unique address…
- … but distributed in 70% of STU -> Higher locality
- Cache hit rate converging around 64 entries, consistent with RocksDB
Data traces are quite different but if the cache size is large enough (say over a paltry 64 lines) they converge to the same hit rate.
Similar findings have been validated with several other benchmarks, not reported here for brevity.
Size dependency:
- Benchmark used: YCSB WL B with Cassandra
- Cache: 4 way Set Associative
- Algorithm: Round Robin
- Observations:
- As expected, hit rate is highly dependent on STU size
- The larger the STU size the better hit rate
- Not all data are created equal: every NVMe command need access to SQ and CQ so such addresses have an outsized impact on hit rate
Modelling Data Pinning and Eviction Algorithms
We can also model the impact of different algorithms for special data pinning (Submission Queue and Completion Queue) and data replacement. The outcome is shown in the two following pictures:
The first set of graphs verifies cache dependencies on both line size, STU size and whether pinning SQ and CQ to ATC makes any difference. The answer is an obvious “yes” for relatively small caches, as the two set of curves are very similar in shape but the ones with SQ/CQ caches start from a much higher hit rate at small caches. For example, at STU = 2MB and only 1 cache line (very unlikely in practice but helps making the point) the hit rate without any SQ/CQ caching is somewhere below 10% but with SQ/CQ pinning is close to 70%. As such, this is a good practice to follow.
As for cache sensitivity to the selected eviction algorithms, we tested Least Recently Used (LRU), Round Robin (RR) and just Random (Rand). As shown below, the impact is quite negligible. As such, the simpler and most efficient algorithm should be chosen.
Algorithms dependency:
- Benchmark used:
- YCSB WL B with Cassandra
- Part of a much larger set
- Associativity: Full and 4 ways
- Eviction algo: LRU, Rand and Round Robin
- Outcome:
- Replacement algorithms do not make any visible difference
- Selecting the simplest implementation may be the most effective approach
Conclusion
So, what can we do with this? What are the gaps in this approach?
This proves a path to quantitatively designing the ATC cache around measured parameters so that performance and design impacts can be properly tuned.
Some caveats to our approach: This report represents our initial analysis and is not conclusive. For example, the applications with the largest benefit from ATS/ATC are in a virtualized environment, but the traces are not from such a setup. The data shows how this gap can be bridged but this approach is more in principle than usable, and every ASIC design should look at appropriate tradeoffs. Another gap to close is that an ASIC design takes close to 3 years and new products can use it for another 2-3 years and have a field life for as many years. It is challenging to project what workloads are going to look like in 3-7 years. How many VMs will run on how many cores and how much memory will we use?
In our testing, we have found a path to a quantitative solution and the unknowns, scary as they may seem, are not unusual in the modeling world and need to be addressed accordingly by any new ASIC design.