Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers
2024-10-15 23:0:0 Author: blog.cloudflare.com(查看原文) 阅读量:1 收藏

2024-10-15

10 min read

Cloudflare's network spans more than 330 cities in over 120 countries, serving over 60 million HTTP requests per second and 39 million DNS queries per second on average. These numbers will continue to grow, and at an accelerating pace, as will Cloudflare’s infrastructure to support them. While we can continue to scale out by deploying more servers, it is also paramount for us to develop and deploy more performant and more efficient servers.

At the heart of each server is the processor (central processing unit, or CPU). Even though many aspects of a server rack can be redesigned to improve the cost to serve a request, CPU remains the biggest lever, as it is typically the primary compute resource in a server, and the primary enabler of new technologies.

Cloudflare’s 12th Generation server with AMD EPYC 9684-X (codenamed Genoa-X) is 145% more performant and 63% more efficient. These are big numbers, but where do the performance gains come from? Cloudflare’s hardware system engineering team did a sensitivity analysis on three variants of 4th generation AMD EPYC processor to understand the contributing factors.

For the 4th generation AMD EPYC Processors, AMD offers three architectural variants: 

  1. mainstream classic Zen 4 cores, codenamed Genoa

  2. efficiency optimized dense Zen 4c cores, codenamed Bergamo

  3. cache optimized Zen 4 cores with 3D V-cache, codenamed Genoa-X

image6

Figure 1 (from left to right): AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), AMD EPYC 9684X (Genoa-X)

Key features common across the 4th Generation AMD EPYC processors:

  • Up to 12x Core Complex Dies (CCDs)

  • Each core has a private 1MB L2 cache

  • The CCDs connect to memory, I/O, and each other through an I/O die

  • Configurable Thermal Design Power (cTDP) up to 400W

  • Support up to 12 channels of DDR5-4800 1DPC

  • Support up to 128 lanes PCIe Gen 5

Classic Zen 4 Cores (Genoa):

  • Each Core Complex (CCX) has 8x Zen 4 Cores (16x Threads)

  • Each CCX has a shared 32 MB L3 cache (4 MB/core)

  • Each CCD has 1x CCX

Dense Zen 4c Cores (Bergamo):

  • Each CCX has 8x Zen 4c Cores (16x Threads)

  • Each CCX has a shared 16 MB L3 cache (2 MB/core)

  • Each CCD has 2x CCX

Classic Zen 4 Cores with 3D V-cache (Genoa-X):

  • Each CCX has 8x Zen 4 Cores (16x Threads)

  • Each CCX has a shared 96MB L3 cache (12 MB/core)

  • Each CCD has 1x CCX

For more information on 4th generation AMD EPYC Processors architecture, see: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf 

The following table is a summary of the specification of the AMD EPYC 7713 CPU in our Gen 11 server against the three CPU candidates, one from each variant of the 4th generation AMD EPYC Processors architecture:

CPU Model

AMD EPYC 7713

AMD EPYC 9654

AMD EPYC 9754

AMD EPYC 9684X

Series

Milan

Genoa

Bergamo

Genoa-X

# of CPU Cores

64

96

128

96

# of Threads

128

192

256

192

Base Clock

2.0 GHz

2.4 GHz

2.25 GHz

2.4 GHz

All Core Boost Clock

~2.7 GHz*

3.55 Ghz

3.1 Ghz

3.42 Ghz

Total L3 Cache

256 MB

384 MB

256 MB

1152 MB

L3 cache per core

4 MB / core

4 MB / core

2 MB / core

12 MB / core

Maximum configurable TDP

240W

400W

400W

400W

* AMD EPYC 7713 all core boost clock is based on Cloudflare production data, not the official specification from AMD

cf_benchmark

Readers may remember that Cloudflare introduced cf_benchmark when we evaluated Qualcomm's ARM chips, using it as our first pass benchmark to shortlist AMD’s Rome CPU for our Gen 10 servers and to evaluate our chosen ARM CPU Ampere Altra Max against AWS Graviton 2. Likewise, we ran cf_benchmark against the three candidate CPUs for our 12th Gen servers: AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), and AMD EPYC 9684X (Genoa-X). The majority of cf_benchmark workloads are compute bound, and given more cores or higher CPU frequency, they score better. The graph and the table below show the benchmark performance comparison of the three CPU candidates with Genoa 9654 as the baseline, where > 1.00x indicates better performance.

image5
 

Genoa 9654 (baseline)

Bergamo 9754

Genoa-X 9684X

openssl_pki

1.00x

1.16x

1.01x

openssl_aead

1.00x

1.20x

1.01x

luajit

1.00x

0.86x

1.00x

brotli

1.00x

1.11x

0.98x

gzip

1.00x

0.87x

1.01x

go

1.00x

1.09x

1.00x

Bergamo 9754 with 128 cores scores better in openssl_pki, openssl_aead, brotli, and go benchmark suites, and performs less favorably in luajit and gzip benchmark suites. Genoa-X 9684X (with significantly more L3 cache) doesn’t offer a significant boost in performance for these compute-bound benchmarks.

These benchmarks are representative of some of the common workloads Cloudflare runs, and are useful in identifying software scaling issues, system configuration bottlenecks, and the impact of CPU design choices on workload-specific performance. However, the benchmark suite is not an exhaustive list of all workloads Cloudflare runs in production, and in reality, the workloads included in the benchmark suites are almost certainly not the exclusive workload running on the CPU. In short, though benchmark results can be informative, they do not represent a good indication of production performance when a mix of these workloads run on the same processor.

Performance simulation

To get an early indication of production performance, Cloudflare has an internal performance simulation tool that exercises our software stack to fetch a fixed asset repeatedly. The simulation tool can be configured to fetch a specified fixed-size asset and configured to include or exclude services like WAF or Workers in the request path. Below, we show the simulated performance between the three CPUs for an asset size of 10 KB, where >1.00x indicates better performance.

 

Milan 7713

Genoa 9654

Bergamo 9754

Genoa-X 9684X

Lab simulation performance multiplier

1.00x

2.20x

1.95x

2.75x

Based on these results, Bergamo 9754, which has the highest core count, but smallest L3 cache per core, is least performant among the three candidates, followed by Genoa 9654. The Genoa-X 9684X with the largest L3 cache per core is the most performant. This data suggests that our software stack is very sensitive to L3 cache size, in addition to core count and CPU frequency. This is interesting and worth a deep dive into a sensitivity analysis of our workload against a few (high level) CPU design points, especially core scaling, frequency scaling, and L2/L3 cache sizes scaling.

Sensitivity analysis

Core sensitivity

Number of cores is the headline specification that practically everyone talks about, and one of the easiest improvements CPU vendors can make to increase performance per socket. The AMD Genoa 9654 has 96 cores, 50% more than the 64 cores available on the AMD Milan 7713 CPUs that we used in our Gen 11 servers. Is more always better? Does Cloudflare’s primary workload scale with core count and effectively utilize all available cores?

The figure and table below shows the result of a core scaling experiment performed on an AMD Genoa 9654 configured with 96 cores, 80 cores, 64 cores, and 48 cores, which was done by incrementally disabling 2x CCD (8 cores/CCD) at each step. The result is GREAT, as Cloudflare’s simulated primary workload scales linearly with core count on AMD Genoa CPUs.

Core count

Core increase

Performance increase

48

1.00x

1.00

64

1.33x

1.39x

80

1.67x

1.71x

96

2.00x

2.05x

TDP sensitivity

Thermal Design Power (TDP), is the maximum amount of heat generated by a CPU that the cooling system is designed to dissipate, but more commonly refers to the power consumption of the processor under the maximum theoretical loads. AMD Genoa 9654’s default TDP is 360W, but can be configured up to 400W TDP. Is more always better? Does Cloudflare continue to see meaningful performance improvement up to 400W, or does performance stagnate at some point?

The chart below shows the result of sweeping the TDP of the AMD Genoa 9654 (in power determinism mode) from 240W to 400W. (Note: x-axis step size is not linear).

Cloudflare’s simulated primary workload continues to see incremental performance improvements up to the maximum configurable 400W, albeit at a less favorable perf/watt ratio.

Looking at TDP sensitivity data is a quick and easy way to identify if performance stagnates at some power point, but what does power sensitivity actually measure? There are several factors contributing to CPU power consumption, but let's focus on one of the primary factors: dynamic power consumption. Dynamic power consumption is approximately CV2f, where C is the switched load capacitance, V is the regulated voltage, and f is the frequency. In modern processors like the AMD Genoa 9654, the CPU dynamically scales its voltage along with frequency, so theoretically, CPU dynamic power is loosely proportional to f3. In other words, measuring TDP sensitivity is measuring the frequency sensitivity of a workload. Does the data agree? Yes!

cTDP

All core boost frequency (GHz)

Perf (rps) / baseline

240

2.47

0.78x

280

2.75

0.87x

320

2.93

0.93x

340

3.13

0.97x

360

3.3

1.00x

380

3.4

1.03x

390

3.465

1.04x

400

3.55

1.05x

Frequency sensitivity

Instead of relying on an indirect measure through the TDP, let’s measure frequency sensitivity directly by sweeping the maximum boost frequency.

At above 3GHz, the data shows that Cloudflare’s primary workload sees roughly 2% incremental improvement for every 0.1GHz all core average frequency increment. We hit the 400W power cap at 3.545GHz. This is notably higher than the typical all core boost frequency that Cloudflare Gen 11 servers with AMD Milan 7713 at 2.7GHz see in production, or at 2.4GHz in our performance simulation, which is amazing!

L3 cache size sensitivity

What about L3 cache size sensitivity? L3 cache size is one of the primary design choices and major differences between the trio of Genoa, Bergamo, and Genoa-X. Genoa 9654 has 4 MB L3/core, Bergamo 9754 has 2 MB L3/core, and Genoa-X has 12 MB L3/core. L3 cache is the last and largest “memory” bank on-chip before having to access memory on DIMMs outside the chip that would take significantly more CPU cycles.

We ran an experiment on the Genoa 9654 to check how performance scales with L3 cache size. L3 cache size per core is reduced through MSR writes (but could also be done using Intel RDT) and L3 cache per core is increased by disabling physical cores in a CCD (which reduces the number of cores sharing the fixed size 32 MB L3 cache per CCD effectively growing the L3 cache per core). Below is the result of the experiment, where >1.00x indicates better performance:

L3 cache size increase vs baseline 4MB per core

0.25x

0.5x

0.75x

1x

1.14x

1.33x

1.60x

2.00x

rps/core / baseline

0.67x

0.78x

0.89x

1.00x

1.08x

1.15x

1.25x

1.31x

L3 cache miss rate per CCD

56.04%

39.15%

30.37%

23.55%

22.39%

19.73%

16.94%

14.28%

Even though the expectation was that the impact of a different L3 cache size gets diminished by the faster DDR5 and larger memory bandwidth, Cloudflare’s simulated primary workload is quite sensitive to L3 cache size. The L3 cache miss rate dropped from 56% with only 1 MB L3 per core, to 14.28% with 8 MB L3/core. Changing the L3 cache size by 25% affects the performance by approximately 11%, and we continue to see performance increase to 2x L3 cache size, though the performance increase starts to diminish when we get to 2x L3 cache per core.

Do we see the same behavior when comparing Genoa 9654, Bergamo 9754 and Genoa-X 9684X? We ran an experiment comparing the impact of L3 cache size, controlling for core count and all core boost frequency, and we also saw significant deltas. Halving the L3 cache size from 4 MB/core to 2 MB/core reduces performance by 24%, roughly matching the experiment above. However, increasing the cache 3x from 4 MB/core to 12 MB/core only increases performance by 25%, less than the indication provided by previous experiments. This is likely because the performance gain we saw on experiment result above could be partially attributed to less cache contention due to reduced number of cores based on how we set up the test. Nevertheless, these are significant deltas!

L3/core

2MB/core

4MB/core

12MB/core

Perf (rps) / baseline

0.76x

1x

1.25x

Putting it all together

The table below summarizes how each factor from sensitivity analysis above contributes to the overall performance gain. There are an additional 6% to 14% of unaccounted performance improvement that are contributed by other factors like larger L2 cache, higher memory bandwidth, and miscellaneous CPU architecture changes that improve IPC.

 

Milan

7713

Genoa

9654

Bergamo

9754

Genoa-X

9684X

Lab simulation performance multiplier

1x

2.2x

1.95x

2.75x

Performance multiplier due to Core scaling

1x

1.5x

2x

1.5x

Performance multiplier due to Frequency scaling

(*Note: Milan 7713 all core frequency is ~2.4GHz when running simulated workload at 100% CPU utilization)

1x

1.32x

1.21x

1.29x

Performance multiplier due to L3 cache size scaling

1x

1x

0.76x

1.25x

Performance multiplier due to other factors like larger L2 cache, higher memory bandwidth, miscellaneous CPU architecture changes that improve IPC

1x

1.11x

1.06x

1.14x

Performance evaluation in production

How do these CPU candidates perform with real-world traffic and an actual production workload mix? The table below summarizes the performance of the three CPUs in lab simulation and in production. Genoa-X 9684X continues to outperform in production.

In addition, the Gen 12 server equipped with Genoa-X offered outstanding performance but only consumed 1.5x more power per system than our Gen 11 server with Milan 7713. In other words, we see a 63% increase in performance per watt. Genoa-X 9684X provides the best TCO improvement among the 3 options, and was ultimately chosen as the CPU for our Gen 12 server.

 

Milan 7713

Genoa 9654

Bergamo 9754

Genoa-X 9684X

Lab simulation performance multiplier

1x

2.2x

1.95x

2.75x

Production performance multiplier

1x

2x

2.15x

2.45x

Production performance per watt multiplier

1x

1.33x

1.38x

1.63x

The Gen 12 server with AMD Genoa-X 9684X is the most powerful and the most power efficient server Cloudflare has built to date. It serves as the underlying platform for all the incredible services that Cloudflare offers to our customers globally, and will help power the growth of Cloudflare infrastructure for the next several years with improved cost structure. 

Hardware engineers at Cloudflare work closely with our infrastructure engineering partners and externally with our vendors to design and develop world-class servers to best serve our customers. 

Come join us at Cloudflare to help build a better Internet!

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

AMDEPYCHardwareCloudflare Network

文章来源: https://blog.cloudflare.com/analysis-of-the-epyc-145-performance-gain-in-cloudflare-gen-12-servers
如有侵权请联系:admin#unsafe.sh