Abstract: The high demand for memory capacity in modern data centers has led to multiple innovative lines of work in memory expansion and disaggregation, with one of the highly acclaimed efforts being memory expansion based on Compute eXpress Link (CXL). Researchers have established several simulation and experimental platforms to better utilize CXL to study its behavior and characteristics. However, due to the lack of commercially available hardware that supports CXL, the full extent of its capabilities may still not be clear. In this work, we explore the performance characterization of CXL memory on the most advanced experimental platform. Firstly, we use micro-benchmarks proposed by us to study the basic performance characteristics of CXL memory. Based on our observations and comparisons with standard DRAM connected to local and remote NUMA nodes, we also study the impact of CXL memory on end-to-end applications with different offloading and interleaving strategies. Finally, we provide some guidelines for future programmers to fully utilize the potential of CXL memory.
1. Brief Introduction
The explosive demand for storing and processing data in data centers, coupled with the limited bandwidth and scalability of traditional DDR memory interfaces, requires the adoption of new memory interface technologies and system architectures. Compute eXpress Link (CXL) has become one of the most promising technologies in the industry and academia, not only for memory capacity/bandwidth expansion but also for memory disaggregation.
CXL is an open standard developed jointly by major hardware suppliers and cloud providers in 2019 and is still rapidly evolving. Specifically, compared to traditional PCIe interconnects, it provides a set of new features that allow the CPU to communicate with peripheral devices (and their connected memory) in a high-speed cache-coherent manner with load/store semantics. Therefore, device expansion related to memory is one of the main target scenarios for CXL.
As the de facto standard for future data centers, major hardware suppliers have announced support for CXL in their product roadmaps. Given the popularity and prospects of CXL memory, it has received significant attention. However, recent research on CXL memory has been based on simulations using multi-slot NUMA systems due to a lack of commercial hardware that supports CXL (especially CPUs), as CXL memory is exposed as NUMA nodes. Therefore, these studies may not accurately model and characterize CXL memory in the real world.
With the emergence of Intel’s 4th generation Xeon Scalable CPU (Sapphire Rapids or SPR) and commercial CXL devices, we can begin to understand the actual characteristics of CXL memory and tailor software systems that fully utilize these characteristics. In this work, we conducted a comprehensive analysis of CXL memory with multiple micro-benchmarks and end-to-end applications on a testbed consisting of an Intel SPR CPU and CXL memory based on the Intel Agilex-I FPGA (CXL controller reinforced in R-Tile). From our micro-benchmark markings, we found that the behavior of CXL memory is different from that of memory in remote NUMA nodes, which is typically used for simulations. Compared to NUMA-based memory, real CXL memory has (1) higher latency, (2) fewer memory channels (resulting in lower throughput), and (3) different transmission efficiency under various operations.
Based on the observations above, we applied CXL memory to three real-world applications exhibiting different memory access behaviors and found that they had varying sensitivities to CXL memory offload. Specifically, we found that (1) a microsecond-latency database was susceptible to increased memory latency, (2) a millisecond-latency microservice with an intermediate compute layer was less affected when running on CXL memory, and (3) a memory-intensive ML inference was sensitive to the random access throughput provided by CXL memory. In all cases, interleaving memory between DRAM connected to the CPU and CXL memory could reduce the performance loss from using CXL memory.
Next, after analyzing the performance characteristics of micro-benchmarks and applications run on a system with CXL memory, we provide practical guidelines for users to optimize their software stack/libraries for maximum performance. For example, bandwidth should be evenly distributed between CXL memory and DRAM to maximize performance; cache bypass instructions should be used for data movement between CXL memory; for a single CXL memory channel, write threads to CXL memory should be limited to reduce write interference as several threads can quickly saturate the load or storage bandwidth; and for read-heavy applications that operate at millisecond-level latency, higher CXL memory latency can be amortized through intermediate computations.
The remainder of this paper is organized as follows. We briefly introduce CXL in Chapter 2 and describe our experimental setup in Chapter 3. We then analyze CXL memory using our micro-benchmarks in Chapter 4 and present our findings with three representative applications in Chapter 5. Finally, we provide guidelines for effectively using CXL memory in Chapter 6.
2. Background
① Compute eXpress Link (CXL)
PCI Express (PCIe) is a standard for high-speed serial computer expansion buses, which replaced the old PCI bus. Since 2003, each generation has doubled the bandwidth, and as of PCIe Gen 5, the bandwidth has reached 32 GT/s (i.e. 64 GB/s, with 16 channels). Its point-to-point topology, coupled with the increase in bandwidth, enables low-latency, high-bandwidth communication with PCIe-connected devices such as graphics processing units (GPUs), network interface cards (NICs), and NVMe solid-state drives (SSDs).
CXL builds a cache coherence system on top of the PCIe physical layer. Similar to standard PCIe communication, data transfer in standard PCIe communication uses Transaction Layer Packets (TLP) headers and Data Link Layer packets, while a subset of the CXL protocol uses predefined headers and 16B blocks to transfer data. In CXL 1.1, depending on the protocol and the data being transferred, the CXL hardware packs headers and data into 68B flits (64B CXL data + 2B CRC + 2B protocol ID) according to a set of rules described in the CXL specification. Unless otherwise specified, in the rest of this article, CXL refers to CXL 1.1.
The CXL standard defines three independent protocols: CXL.io, CXL.cache, and CXL.mem. CXL.io uses functions such as TLP and DLLP in standard PCIe communication, mainly for protocol negotiation and host device initialization. CXL.cache and CXL.mem use the protocol above headers for device access to host memory and host access to the device memory.
By combining these three protocols, CXL defines three types of devices for different use cases. Type-1 devices use CXL.io and CXL.cache and typically refer to SmartNICs and accelerators that do not manage host memory. Type-2 devices support all three protocols. These devices, such as GPUs and FPGAs, have additional memory (DRAM, HBM) that the host CPU can access and cache, and they also use CXL.cache for device-to-host memory access. Type-3 devices support CXL.io and CXL.mem and are often seen as extensions of existing system memory. In this article, we will focus on Type-3 devices and discuss the lower-level details of CXL.mem.
As the CXL.mem protocol only considers host-to-device memory access, the protocol consists of two simple memory accesses: read and write from host-to-device memory. Each access is accompanied by a completed reply from the device. When reading from device memory, the response packet includes data, while for writes it only includes a complete header. Figure 1 shows these access paths.
Figure 1: Typical system architecture supporting CXL (left) and memory transaction flow
The CXL.mem protocol communicates between the CPU’s own agent and the CXL controller on the device. When the ownership agent handles the protocol, the CPU issues load and store instructions to access the memory in the same way as accessing DRAM. This has an advantage over other memory extension solutions such as Remote Direct Memory Access (RDMA), which involves a DMA engine on the device and therefore has different semantics. Integrating load/store instructions with CXL.mem also means that the CPU caches PCIe-connected memory in all levels of its cache, which is impossible for any other memory extension solution except persistent memory.
② Hardware Support for CXL
CXL requires support from both the host CPU side and the device side. As of now, major hardware suppliers such as Samsung, SK Hynix, Micron, and Lansen have designed multiple storage devices that support CXL, except for some research prototypes. To promote more flexible memory functionality and near-memory computing research, Intel has also enabled CXL.mem on its latest Agilex-I series FPGAs, where CXL and memory-related IP cores are hard-coded on a small chip to achieve high performance. On the host CPU side, Intel’s latest 4th generation Xeon Scalable Processor (codenamed Sapphire Rapids, SPR) is the first high-performance commercial CPU to support the CXL 1.1 standard. We expect that in the near future, more hardware suppliers’ products will support richer CXL.
3. Experimental Facility
In this work, we used two testbeds to evaluate the latest commercial CXL hardware, as shown in Table 1. The server is equipped with an Intel Gold 6414U CPU and 128 GB of CXL.mem protocol, as well as 4800MT/s DDR5 DRAM (distributed across 8 memory channels). In the 4th generation, the Intel Xeon CPU is implemented as four separate small chips. Users can decide to use these 4 chips as a unified processor (i.e., sharing last-level cache (LLC), integrated memory controller (IMC), and root complex), or run each small chip as a small NUMA node in Sub-NUMA Clustering (SNC) mode. This flexibility allows users to fine-tune the system to adapt to their workload characteristics and achieve fine-grained control over resource sharing and isolation. In our experiments, we will explore how memory interleaving between SNC and CXL.mem will impact application performance. We also conducted some experiments on a dual-socket system with two Intel Platinum 8460H and the same DDR5 DRAM to make some comparisons between NUMA-based conventional memory and CXL.mem.
Table 1: Experimental Setup
For CXL memory devices, the system has an Intel Agilex-I development kit. It has 16 GB 2666MT/s DDR4 DRAM as CXL memory and is connected to the CPU via an x16 PCIe Gen 5 interface. It is transparently exposed to the CPU and OS as a NUMA node with 16 GB of memory and no CPU cores, and the use of CXL memory is managed the same as conventional NUMA-based memory.
Note that the CXL protocol itself does not define the underlying memory configuration. Such configurations include but are not limited to, capacity, medium (DRAM, persistent storage, flash chips, etc.), and the number of memory channels. Therefore, different devices may exhibit different performance characteristics.
4. Microbenchmark-based characterization study
In this section, we presented our findings on using our micro-benchmarks to evaluate CXL memory. We believe that this analysis provides insights into how CXL memory users can more effectively utilize CXL memory based on their use cases. We also compared these results to some assumptions and simulations in recent work on CXL memory, where CXL memory was simulated using cross-NUMA data access with some additional latency.
① The micro-benchmarks
To thoroughly investigate the capabilities of CXL memory, we developed a microbenchmark called MEMO. This benchmark is designed to target various use cases of CXL memory and runs in the Linux user space. Users can provide command-line arguments to specify the workload that MEMO should perform. We plan to open-source MEMO in the future.
Specifically, MEMO has the ability to: (1) allocate memory from different sources using the NUMA_lalloc_node function, including local DDR5 memory, CXL memory with no CPU NUMA node, or remote DDR5, (2) start a specified number of test threads, pin each thread to a core, and optionally enable or disable hyperthreading, and (3) perform memory access using inline assembly language and report the memory access latency or aggregated bandwidth for different instructions such as load, store, and non-temporal store, all using AVX-512 instructions. Additionally, MEMO can perform pointer chasing on memory regions, and by varying the working set size (WSS), the benchmark can show how the average access latency changes with different sizes of the cache hierarchy.
② Latency Analysis
In the latency analysis, MEMO starts by refreshing the cache line at the test address and immediately issues a fence instruction. Then, a set of nop instructions are issued to flush the CPU pipeline. When testing with load instructions, we record the time taken to access the cleared cache line, while when testing with store instructions, we record the time taken to execute a temporary store, then record the cache line write-back (Clwb), or the time taken to execute a non-temporal store, followed by a fence. Additionally, we tested the average access latency of pointer chasing in a large memory space with all prefetching disabled. Figure 2 shows the latency of the four test instructions, the average pointer chasing latency under a 1GB memory space, and the pointer chasing latency under different working set sizes.
Figure 2 shows the average access latency for a single AVX512 load (ld), store and write-back (st+wb), non-temporal store (nt-st), and sequential pointer chasing (ptr-chase) in 1GB space, as well as the pointer chasing latency with varying working set sizes. All levels of prefetching are disabled in both cases.
Our results in Figure 2 show that CXL memory access latency is approximately 2.2 times higher than local DDR5 with 8 channels (DDR5-L8), while remote DDR5 with a single channel is about 1.27 times higher than DDR5-L8. Previous studies on persistent memory have suggested that accessing recently refreshed cache lines may result in higher latency than a normal cache miss due to extra cache coherence handshakes required to refresh the cache line. Pointer chasing reflects more realistic access latency experienced by applications. In MEMO, the working set is first introduced to the cache hierarchy during a warm-up run. The right part of Figure 2 shows the average memory access time with varying working set sizes. Each jump in pointer chasing latency corresponds to the boundary of L1, L2, and LLC sizes. The results in Figure 2 shows that the pointer chasing latency in CXL memory is four times higher than DDR5-L8 access, and 2.2 times higher than DDR5-R1 access. Interestingly, when the working set size is between 4MB to 32MB, DDR5-R1 exhibits higher latency than CXL memory. We believe that this difference can be attributed to the difference in LLC size, where the LLC size of Xeon 8460H is almost twice that of Xeon 6414U.
It is worth noting that while the CXL controller and DDR4 memory controller have been hardened on FPGA chips, the longer access latency of CXL memory can be partially attributed to its FPGA implementation. Although we expect an ASIC implementation of CXL memory devices to improve latency, we believe it will still be higher than regular cross-NUMA access, mainly due to overheads associated with the CXL protocol. Furthermore, our practical application evaluation in Section 5 shows that the latency penalty is reduced depending on the specific characteristics of the application. It should also be noted that the benefit of FPGA-based CXL devices is that they can add (inline) acceleration logic on the CXL memory data path, as well as offload memory-intensive tasks in a manner close to memory.
On the other hand, the latency of non-temporal stores with a fence on CXL memory is significantly lower than that of storing to cache lines and then writing them back. Both of these operations transfer data from CPU cores to CXL memory, with a latency difference of about 2.6x. This latency difference is due to the ownership read for ownership (RFO) behavior in CXL’s MESI cache coherence protocol, where each store miss loads the cache line into the cache. This difference in access latency will later translate into a bandwidth difference, which will be discussed in section 4.3.
③ Bandwidth Analysis
In our bandwidth testing, MEMO performed sequential or random block access in each test thread. The main program calculated the average bandwidth with fixed intervals by summing the number of accessed bytes. To facilitate a fair comparison of memory channel counts, we tested remote DDR5 with only one memory channel (DDR5-R1) next to the CXL memory.
a. Sequential Access Pattern
The sequential access pattern reflects the maximum possible throughput of the memory scheme under specific operations, as shown in Figure 3. During testing with DDR5-L8, we observed load bandwidth scaling linearly until its peak at a maximum bandwidth of 221 GB/s, with approximately 26 threads. In comparison, non-temporal store instructions achieved a maximum bandwidth of 170GB/s, which is lower than the bandwidth of load instructions, but with a lower number of threads, around 16.
Figure 3: Sequential Access Bandwidth. The experiment shows the maximum possible bandwidth of 8-channel (a) local DDR5, CXL memory (b), and 1-channel (c) remote DDR5. The grey dashed line in (b) shows the theoretical maximum speed of DDR4-2666MT/s.
Compared to DDR5-L8, CXL memory exhibits a significant bandwidth trend. Specifically, CXL memory can reach its maximum bandwidth with approximately 8 threads using load instructions, but this value drops to 16.8 GB/s when the thread count increases to 12 or more. On the other hand, non-temporal stores exhibit an impressive maximum bandwidth of 22GB/s with only 2 threads, which is close to the maximum theoretical bandwidth of tested DRAM. However, when we increase the thread count, this bandwidth immediately drops, indicating some interference with the FPGA memory controller.
The time store bandwidth of CXL memory is significantly lower than that of non-temporal stores, consistent with the high latency reported in section 4.2. This difference is due to the RFO behavior in time stores described in section 4.2, which significantly reduces the transmission efficiency of CXL memory. This reduction is due to the additional core resources and extra chip-to-chip round trips required to load and evict cache lines compared to non-temporal stores.
Figure 4 shows the experimental results of data transfer efficiency under different workloads. The abbreviation D2C refers to local DDR5 to CXL memory. All experiments in (b) were conducted using a single thread.
In addition to the instructions mentioned above, we also tested a new x86 instruction, movdir64B, which is newly provided on SPR. This instruction moves 64B of data from the source memory address to the destination memory address and explicitly bypasses cache loading from the source and stores it in the target location. As shown in Figure 4a, our results indicate that D2* operations exhibit similar behavior, while C2* operations generally exhibit lower throughput. From these results, we can conclude that slower loads from CXL memory lead to lower throughput of movdir64B, especially in the case of C2C.
Furthermore, Figure 3c shows that the sequential access performance of DDR5-R1 is similar to that of CXL memory. Due to the higher transfer rate and lower latency of DDR5 and UPI interconnect, DDR5-R1 exhibits higher throughput in the load and non-temporal storage cases, but similar throughput in the temporal storage case compared to CXL memory.
As a new product from SPR, the Intel Data Streaming Accelerator (Intel DSA) allows memory movement operations to be offloaded from the host processor. The Intel DSA consists of a Work Queue (WQ) and a Processing Engine (PE), with the former used to store unloaded work descriptors and the latter used to extract descriptors from the WQ for operation. Descriptors can be synchronously sent by waiting for each unloaded descriptor to complete before unloading another descriptor or asynchronously sent by continuously sending descriptors, allowing for many running descriptors in the WQ. By designing programs to asynchronously use the Intel DSA in an optimal way, higher throughput can be achieved. To further increase throughput, operations can be batched to amortize the unloaded latency. Figure 4b shows the maximum throughput observed when performing memory copy operations using memcpy or movdir64B on the host processor, and synchronously/asynchronously executing them using the Intel DSA with different batch sizes (such as 1, 16, and 128). While non-batched synchronous unloads to the Intel DSA match the throughput of non-batched memory copies on the host processor, any level of asynchronous or batching brings improvement. Additionally, splitting the source and destination data locations can result in higher throughput compared to using only CXL-connected memory, with the C2D case reporting higher throughput due to lower write latency on DRAM.
During our bandwidth analysis, we observed a decrease in bandwidth as the number of threads increased. While data access is sequential within each working thread, as the number of threads increases, the memory controller between the CXL controller and the extended DRAM receives fewer request patterns. As a result, the performance of CXL memory is hindered.
b. Random Access Mode
To evaluate the performance of MEMO on random block access, we sequentially released blocks accessed with AVX-512, but with a random offset each time. This approach allowed us to measure the system’s performance under realistic working conditions where data access patterns are unpredictable. As we increased the thread count in our tests, the memory access pattern converged to sequential access, where both the CPU cache and memory controller can enhance the overall bandwidth. To ensure the right order at the block level, we issued a fence after each block was stored in non-temporary storage. The results of random block access are shown in Figure 5.
Figure 5: Random block access bandwidth. Row order (top to bottom): local DDR5, CXL memory, remote DDR5. Column order (left to right): load, store, nt store. The thread count is indicated in the legend at the top.
When block sizes are small (1KB), we observe similar patterns in random block loads, where all three memory schemes are equally affected by random accesses. However, as we increase block sizes to 16KB, we see major differences between DDR5-L8 and DDR5-R1/CXL memory. DDR5-L8 bandwidth scales sublinearly with thread count, while DDR5-R1 and CXL memory sees less benefit from higher thread counts, particularly in CXL memory. Memory channel count plays a critical role, with DDR5-R1 and our CXL memory device having only one memory channel, while DDR5-L8 has a total of eight channels. Random block stores exhibit patterns similar to loads in terms of scaling with thread count, but with an added trend of increasing bandwidth leveling off with block size scaling.
In contrast to all other tested workloads, the behavior of non-temporal random block stores in CXL memory displays an interesting trend. Single-threaded nt stores scale well with block size, while as thread count increases, throughput drops off after reaching some optimal point of block size and thread count. For example, at a block size of 32KB, 2 threads reach peak bandwidth, while 4 threads reach peak bandwidth at a block size of 16 KB.
We believe this optimal point is determined by the memory buffer within the CXL memory device. Unlike regular memory instructions, nt stores do not occupy tracking resources within the CPU core. As a result, it is easier to have more dynamic nt store instructions simultaneously, which may lead to a buffer overflow in the CXL memory device.
However, the advantage of non-temporal instructions is to avoid RFO (as discussed in section 4.2) and cache pollution, making them more attractive in CXL memory setups. Programmers who wish to use nt stores should be aware of this behavior to fully utilize nt stores in CXL memory.
④ Comparisons with simulation using NUMA systems
Recent studies on CXL memory are often conducted through simulations that emulate CXL memory access latency by imposing an additional delay on main memory accesses across NUMA systems. However, based on our observations, cross-NUMA simulation models cannot accurately simulate the following characteristics of CXL memory: (1) the impact of limited bandwidth in current CXL devices (unless the number of channels populated with remote DIMMs is the same as that of CXL memory), (2) various CXL memory implementations with higher latency compared to cross-NUMA access (where higher latency has a more severe impact on latency-bound applications), and (3) data transfer efficiency under different workloads (i.e., load and non-temporal storage bandwidth).
5. Practical Applications
To study the impact of CXL memory on performance, we explored binding all or part of an application’s memory to CXL memory. Linux provides the numactl program, which allows users to (1) bind a program to a specific memory node (membind mode), or (2) preferentially allocate memory to a node and only allocate to other nodes when the memory on the specified node is exhausted (preferred mode), or allocate evenly across a set of nodes (interleave mode).
A recent patch in the Linux kernel now allows fine-grained control over page interleaving rates between memory nodes. This means that, for example, if we set the DRAM: CXL ratio to 4:1, we can allocate 20% of memory to CXL memory. To study the impact of CXL memory on application performance, we adjusted this interleaving ratio for several applications. Additionally, we disabled NUMA balancing to prevent pages from migrating to DRAM.
The performance of these applications using this heterogeneous memory scheme should serve as a guideline for most memory tiering strategies. This is because the proposed optimization should perform at least as well as a weighted round-robin allocation strategy.
① Redis-YCSB
Redis is a popular and widely used high-performance in-memory key-value store in the industry. We used YCSB to test the performance of Redis under different memory allocation schemes, by fixing its memory to CXL memory, DRAM, or distributed between the two. To evaluate system performance, we executed multiple workloads in the YCSB client, while limiting the queries per second (QPS). Specifically, we measured two metrics: (1) the tail latency of the 99th percentile in queries, and (2) the maximum sustainable QPS. Except for workload D, all workloads used a uniform distribution of requests to ensure maximum memory pressure. We also fine-tuned the interleave ratio (DRAM: CXL) to offload a certain amount of memory to CXL, using ratios such as 30:1 (3.23%) and 9:1 (10%) in different experiments.
Our results in Figure 6 show that when Redis runs purely on CXL memory, there is a significant difference in p99 tail latency at low QPS (20k). This difference remains relatively constant until 55k QPS, at which point the YCSB client cannot reach the target QPS, causing a sudden increase in tail latency. When 50% of Redis memory is allocated to CXL memory, p99 tail latency is between pure DRAM and pure CXL memory. Although 50% CXL memory Redis does not saturate its QPS until 65k, and tail latency spikes around 55k. Finally, DRAM Redis shows stable tail latency, with its QPS saturating at around 80k.
Figure 6 shows the p99 latency of Redis, tested with YCSB workload A (50% read, 50% update), with different memory allocation schemes. The legend in the figure indicates R/U, which represents read/update latency when Redis runs with 50%/100% of its memory allocated to CXL memory.
We believe that the difference in tail latency is due to the ultra-low response latency of Redis queries, which makes these microsecond-level responses highly sensitive to memory access latency. This is highly relevant to the latency measurement results we presented in section 4.2, where CXL memory access latency ranges from hundreds to 1000 nanoseconds, which is 2-4 times higher than DRAM. However, intermediate computation and cache hits reduce the latency difference (in terms of application tail latency) to about two times before the QPS saturation point.
On the other hand, the maximum sustainable QPS that CXL memory Redis can provide (Figure 7) is related to the observed random block access bandwidth in section 4.3.2, where the single-threaded load/store bandwidth of CXL memory is much lower than local-DDR5 or remote-DDR5.
Figure 7 shows the maximum sustainable Redis QPS with various CXL memory configurations. The legend indicates the percentage of Redis memory allocated to CXL memory. YCSB workload D is used by default to read the most recently inserted elements (lat), but we also tested this workload with read requests in a Zipfian (zipf) or uniform (uni) distribution to see the impact on access locality. Workload E, which is range queries, is omitted here.
Single-threaded random access bandwidth is limited by memory access latency, where data dependencies within a single thread make it difficult for the load-store queues in the CPU to saturate. Additionally, there is a trend in Figure 7 where allocating less memory to CXL can provide higher maximum QPS in all tested workloads, but these still cannot exceed the performance of running Redis purely on DRAM. In this case, memory interleaving cannot improve the performance of a single application because interleaving with CXL memory will always introduce higher access latency. Note that the current CXL memory setup is FPGA-based and its true benefits lie in its flexibility. We expect ASIC-based CXL memory to provide relatively lower access latency, thus improving the performance of latency-sensitive applications.
Figure 8: DLRM embedding reduction in throughput. Tested with 8-channel DRAM and CXL memory; throughput vs thread count (left); normalized throughput relative to DRAM for different memory configurations at 32 threads (right).
② Embedding Reduction in DLRM
The deep Learning Recommendation Model (DLRM) has been widely deployed in the industry. Embedding reduction is a step in DLRM inference known for its high memory usage, occupying 50%-70% of the inference latency. We tested embedding reduction on DRAM, CXL memory, and interleaved memory, using the same settings as in MERCI.
The results in Figure 8 show that running DLRM inference on each scheme is linear with increasing thread count, with different slopes. The overall trend of DDR5-R1 and CXL memory is similar, consistent with the observations in section 4.3.2, where DDR-R1 and CXL memory have similar random load/store bandwidth at small access granularities. The two points of memory interleaving (3.23% and 50% on CXL memory) are shown in Figure 8. As we decrease the amount of memory interleaved into CXL, the inference throughput also increases. However, we observe again that even with 3.23% of memory allocated to CXL, the inference throughput cannot match that of running purely on DRAM. Also note that the pure DRAM inference throughput scales linearly, and its linear trend seems to extend beyond 32 threads. Combining these two observations, we can conclude that an 8-channel DDR5 memory can sustain DLRM inference beyond 32 threads.
To demonstrate a scenario where application performance is limited by memory bandwidth, we tested the inference throughput in SNC mode. As a reminder, Intel introduced the Sub-NUMA Clustering (SNC) feature in SPR, where the small chip is split into four separate NUMA nodes, and each NUMA node’s memory controller works independently of the other nodes. By running inference on one SNC node, we effectively limited inference to run on two DDR5 channels, making it memory-bound.
Figure 9 shows the results of running inference in SNC mode, with CXL memory interleaved in the same way as all previous experiments. The green bar in the figure shows the inference throughput on SNC, which stops scaling linearly after 24 threads. At 28 threads, the inference is limited by the two memory channels, and interleaving memory to CXL produces slightly higher inference throughput. This trend persists, and at 32 threads, putting 20% of the memory on CXL results in an 11% increase in inference throughput compared to the SNC scenario. We expect that the bandwidth of CXL devices will be equivalent to that of native RAM in the future, further improving the throughput of memory-bound applications.
Figure 9: DLRM Embedding Reduction Throughput
③ DeathStarBench
DeathStarBench (DSB) is an open-source benchmark suite aimed at evaluating the performance of microservices on a system. It uses Docker to launch the components of microservices, including machine learning inference logic, web backends, load balancers, cache, and storage. DSB provides three separate workloads for a social network framework and a mixed workload.
Figure 10 shows the tail latency at the 99th percentile for posting, reading user timelines, and a mixed workload. We omitted the results for reading the home timeline because it doesn’t operate on the database and is thus agnostic to the memory type used for the database. In our experiments, we fixed components with a large working set size (i.e., storage and cache applications) to either DDR5-L8 or CXL memory. We left the compute-intensive parts in DDR5-L8. The memory breakdown of these components is shown in Figure 10.
The results in Figure 10 show tail latency differences in posting, while there is almost no difference in reading user timelines and the mixed workload. Note that the tail latency in DSB is in the millisecond range, much higher than YCSB Redis.
When analyzing the write post and read user timeline workloads, we found that writing posts involved more database operations, which put a greater load on the CXL memory. Meanwhile, most of the response time for reading user timelines was spent on the nginx front end. This allowed longer CXL memory access latencies to be amortized across the compute-intensive components, greatly reducing the dependency of tail latency on database access latency.
Finally, the mixed workload showed a realistic simulation of social networks, where most users read some posts written by other users. Although in the mixed workload, fixing the database to CXL memory will show slightly higher latency with increasing QPS, the overall saturation point is similar to running the database on DDR5-L8.
The results of DSB provide an interesting use case for CXL memory, where as long as compute-intensive components are kept in DRAM, cache and storage components running at low demand rates can be allocated to slower CXL memory, and the performance of the application remains largely unchanged.
6. Best Practices for CXL Memory
Given the unique hardware characteristics of CXL memory, we offer the following insights for fully leveraging CXL memory:
When moving data out of or into CXL memory, use non-temporal stores or movdir64B. As shown in Section 4, different x86 instructions exhibit significant performance differences when accessing CXL memory, due to the microarchitecture design of CPU cores and the inherent behavior of CXL memory. Considering the usage patterns of CXL memory (e.g., memory tiering), the likelihood of short-term data reuse is low. To achieve higher data movement throughput and avoid polluting precious cache resources, we recommend prioritizing the use of nt store or movdir64B instructions in the corresponding software stack. Note that since both nt store and movdir64B is weakly ordered, memory fences are needed to ensure that data has been written.
Limit the number of threads concurrently writing to CXL memory. As previously mentioned, the performance of CXL memory depends on the CPU and device controllers. This is especially true for concurrent CXL memory access, as the contention may occur at multiple points. Although the current FPGA-based implementation of CXL memory controllers may limit the size of internal buffers, thereby limiting the number of dynamic memory instructions, we anticipate that this problem will still exist on pure ASIC-based CXL memory devices. It is best to have a centralized communication stub on the CPU software side for data movement. We recommend that CXL memory should be managed by the operating system or dedicated software daemon, rather than by all applications.
Use Intel DSA for high-capacity memory transfers from CXL memory to CXL memory. When transferring large amounts of data between conventional DRAM and CXL memory, the first two insights may still be insufficient as they consume a lot of CPU cycles and still have limited instruction/memory-level parallelism. We found that Intel DSA has high throughput, flexibility, and fewer limitations compared to previous-generation products, and can further enhance the performance and efficiency of such data movement. This is particularly useful in hierarchical memory systems, where data movement is typically done at the page granularity (i.e., 4KB or 2MB).
Interleave memory using NUMApolices and other hierarchical memory methods to evenly distribute memory loads across all DRAM and CXL channels. In addition to using CXL memory as slower DRAM or faster SSDs (e.g., memory tiering), CXL memory can also be interleaved with conventional memory channels to increase total memory bandwidth, especially when CXL memory devices serve as additional memory channels (and therefore have comparable memory bandwidth). Careful selection of interleaving percentages/strategies can greatly mitigate expected performance degradation.
Avoid running applications with μs-level latency on CXL memory. The relatively long access latency of CXL memory may become a major bottleneck for applications that require real-time data access at a fine time granularity (μs level). Redis is an example of such an application – the latency data access caused by CXL memory will accumulate to the critical value of end-to-end query processing. Such applications should still consider keeping their data on faster media.
Microservices may be a good candidate for offloading to CXL memory. The microservices architecture has become the mainstream development approach for today’s data center internet and cloud services due to its flexibility, ease of development, scalability, and agility. However, its layered and modular design does bring higher runtime overhead compared to traditional monolithic applications. Such a characterization makes it less sensitive to the underlying cache/memory configuration and parameters. Our study on DSB (see Section 5.3) also confirms this. We envision that a large portion of microservices data can be offloaded to CXL memory without affecting its latency or throughput performance.
Explore online acceleration potential with programmable CXL memory devices. Given the insights above, applications that are suitable for offloading to CXL memory may be less sensitive to data access latency. This provides more design space for in-line acceleration logic within CXL memory devices – although such acceleration may add additional latency to data access, it is invisible from the end-to-end perspective of the target application. Therefore, we still advocate for FPGA-based CXL memory devices, as they offer flexibility and programmability.
① Application Classification
Based on the performance when running on CXL memory, we have identified two types of applications: bandwidth-limited and latency-limited.
Bandwidth-limited applications typically experience sub-linear throughput scaling beyond a certain number of threads. While both Redis and DLRM inference running on CXL memory show lower saturation points, it should be noted that only DLRM inference is a bandwidth-limited application. Single-threaded Redis is limited by the higher latency of CXL memory, which slows down its processing.
Latency-limited applications perceive throughput degradation even when a small working set is assigned to high-latency memory. In the case of databases, they may show tail latency gaps even when QPS is far from saturation when running on CXL memory. Databases like Redis and Memcached, which run at μs-level latencies, suffer the most when running purely on CXL memory. On the other hand, microservices with compute layers at millisecond levels show good use cases for offloading memory to CXL memory.
However, in both cases, using interleaved memory between traditional CPU-connected DRAM and CXL memory can reduce losses in throughput (Section 5.2) and tail latency (Section 5.1) for slower CXL memory. This cyclical strategy should serve as the baseline for a tiered memory strategy.
7. Correlational Research
As memory technologies rapidly evolve, there have been many new types of memory introduced in data centers, each with different characteristics and trade-offs, including but not limited to persistent memory such as Intel Optane DIMMs, RDMA-based remote/disaggregated memory, and even byte-addressable SSDs. While CXL memory has been widely studied and analyzed, as a new member in the memory hierarchy, its performance characteristics and metrics are still not clear.
Since its conceptualization in 2019, CXL has been discussed by many researchers. For example, Meta envisioned using CXL memory for memory tiering and swapping; Microsoft built a CXL memory prototype system for exploring memory disaggregation. Most of them used NUMA machines to simulate CXL memory behavior. Gouk et al. built a CXL memory prototype on an FPGA-based RISC-V CPU. Unlike previous research, we are the first to conduct CXL memory research on commodity CPUs and CXL devices, with both microbenchmarks and real-world applications. This makes our research more realistic and comprehensive.
8. Conclusion
CXL has become the future device interconnect standard with rich useful features, and CXL memory is one of its important features. In this work, based on state-of-the-art actual hardware, we performed a detailed character analysis of CXL memory using microbenchmarks and practical applications. Through unique observations of CXL memory behavior, we also provide some useful guidelines for programmers to better leverage CXL memory. We hope that this work can promote the development and adoption of the CXL memory ecosystem.