Assessing Uniform Memory Access Performance on AMD's Turin

Understanding how uniform memory access (UMA) functions in modern server architectures

As hardware interconnects grow increasingly complex and non-uniform, evaluating the efficiency of UMA becomes essential. Non-Uniform Memory Access (NUMA) allows hardware to expose explicit associations between cores and memory controllers to optimize performance. Traditionally, NUMA nodes align with individual socket boundaries, but recent server chips can subdivide a socket into multiple NUMA nodes, reflecting the complexity of non-uniform interconnects. AMD labels their NUMA modes with the NPS (Nodes Per Socket) designation.

The NPS0 mode is an exception, offering a different approach. Instead of subdividing a socket into multiple nodes, NPS0 presents a dual-socket system as a single unified resource. It evenly distributes memory access across all available memory controllers, providing a uniform memory access experience similar to that of desktop systems. Since optimizing NUMA configurations involves significant effort—requiring programmers to assign memory allocations to specific nodes and minimize cross-node traffic—NPS0 offers a simplified alternative. However, this simplification means that each code vertex might be limited by the core count, bandwidth, or capacity of its designated node, potentially constraining scalability.

A recent analysis using AMD’s EPYC 9005 Series architecture offers insights into how a cutting-edge server behaves under NPS0 mode, which reveals a setup with 24 memory controllers providing uniform memory access. Testing showed that DRAM latency increased dramatically, rising above 220 nanoseconds in NPS0 mode—about a 90-nanosecond penalty compared to NPS1 mode or older systems like dual socket Broadwell, which had latency around 75.8 to 104.6 nanoseconds depending on configuration.

While NPS0 offers bandwidth advantages—thanks to the increased number of memory controllers—these benefits do not translate into lower latency unless memory bandwidth approaches 400 GB/s. During tests with mixed workloads, the EPYC 9575F struggled; latency-sensitive threads were delayed by bandwidth-heavy ones. Achieving high bandwidth (up to 479 GB/s in linear read tests) was possible when threads operated sequentially, but concurrent loads impacted overall throughput.

Per-CCD (Core Complex Die) bandwidth remains relatively unaffected by the mode change because both NPS0 and NPS1 use GMI-Wide links, which offer 64 bytes per cycle at the Infinity Fabric clock. This results in substantial bandwidth to each CCD—more than comparable GMI-Narrow configurations, which typically deliver less than 64 GB/s.

Performance in CPU-bound workloads, like those measured by SPEC CPU2017, remains promising despite higher memory latency. The EPYC 9575F, running at 5 GHz, mitigates some latency penalties with high clock speeds. Overall, the impact of UMA on single-threaded tasks is context-dependent; bandwidth-heavy scenarios reveal the limitations of uniform access, but the CPU’s performance benefits from high core counts and clock speeds.

In conclusion, while uniform memory access modes provide a simplified and balanced approach, they come with latency costs that can affect performance in latency-sensitive applications. The choice between UMA and NUMA configurations depends on workload characteristics and system design priorities.

Frequently Asked Questions (FAQs):

Q: What is uniform memory access (UMA) in modern servers?
A: UMA is a memory architecture where all processors have equal access latency and bandwidth to all memory locations, simplifying programming and application scaling.

Q: Why does NPS0 mode provide uniform memory access?
A: NPS0 treats a dual socket system as a single resource, evenly distributing memory access across all controllers, reducing the complexity of NUMA tuning.

Q: What are the performance trade-offs of using UMA?
A: UMA simplifies programming but incurs higher latency, especially under memory-intensive workloads, which can impact single-threaded performance.

Q: How does memory latency affect server workloads?
A: Increased latency can slow down applications that rely on frequent memory access, especially those that are single-threaded or sensitive to delays.

Q: Is uniform memory access suitable for all workloads?
A: No; workloads with high bandwidth demands might benefit from UMA, but latency-sensitive tasks may perform better under NUMA configurations with optimized local memory access.