Why does a server CPU perform faster tasks than a Macbook Pro CPU with the same benchmark score? [closed]

Solution 1:

Given the task you describe, compiling a Bable project, and the CPUs involved I think I know the source of the difference in performance. I wanted to answer earlier but had to do a bit of research to confirm my hunch.

First, let's characterize the load you're putting on your system.

Babel.js is written as a single-threaded, single process compiler that mostly leverages asynchronous I/O for parallelism (at least nothing I've googled indicates it using worker threads). Since it is a compiler that compiles files from disk a large part of its execution involves waiting for data from disk. This gives us the following workload:

  1. Single-threaded so multiple cores or hyperthreading have no significant effect on compilation with one caveat:

  2. Node.js uses worker threads to handle disk I/O but beyond a two or four hardware threads there is no additional advantages to multiple cores (see: https://nodejs.org/en/docs/guides/dont-block-the-event-loop/)

  3. Most of the parallelism takes place at the I/O level. Babel will try to read as many files in parallel as possible.

Both the i5 and the Xeon are fairly comparable with regards to points 1 and 2. So let's look at how the CPU can handle point 3: servicing Babel's parallel file read request.

Here's the first big difference between the two systems:

  • The Core i5 8259 has 16 PCI lanes

  • The Xeon 8151 has 48 PCI lanes

So clearly the Xeon can handle more parallel I/O operations than the i5. When there's more I/O than the number of memory transfer lanes available the OS handles it the same way as when there's more tasks than number of hardware threads available: it queues them up and force them to take turns.

Next I wanted to know if NVME can actually use multiple lanes. This is where I hit another interesting fact. The NVME standard allows a card to use up to 4 PCI lanes (there is physically that many connections allocated) but some cards use only 2 while others use 4. So not all NVME cards are created equal. This alone will give you double the number of files Babel can copy to RAM in parallel at almost double the bandwidth.

It also depends how the NVME slot is connected to the CPU. The Core i5 having only 16 PCI lanes will no doubt be reserving at least 8 of them for the GPU. Leaving you only 8 to be shared among other devices. This means that sometimes your NVME card will have to share bandwidth with your Wifi or other hardware. This slows it down a bit more.

And your NVME may not even be connected directly to your CPU's PCI lanes. The Macbook may actually reserve all 16 lanes for the GPU and connect to your NVME via its south bridge (which may have additional PCI lanes). I don't know if the Macbook does this but this again may reduce performance a bit more.

In contrast, the large number of lanes that the Xeon has allow the motherboard designer much more freedom to create a really fast I/O platform. In addition the AWS server does not normally have a GPU installed so it does not need to reserve any lanes for GPU use. Again, I don't personally know the actual architecture of AWS servers but it's possible to create one that outperform a Macbook at compiling Babel projects.

So in the end the main factors that enables the EC2 instance to outperform the Macbook are:

  1. Number of PCI lanes directly supported by the CPU

  2. Number of PCI lanes supported by the NVME drive

  3. How the NVME lanes are connected to the CPU

Additional factors that may contribute include:

  1. The speed of the I/O bus (PCI2 vs PCI3 etc)

  2. The speed of RAM

  3. Number of DMA channels available (this alone requires a long answer so I sort of skipped it but the reasoning is similar to PCI lanes)

Solution 2:

Benchmarks are vague handwaves to some very specific performance characteristics (peak instruction rate) that often do not take into account other factors in a system.

A non-exhaustive list of things that can make a big difference to programs but not peak instruction rates:

  • Memory. Type, bandwidth, channels. These all make a difference in how fast data can get to the CPU for it to do work. Servers typically have more channels of RAM, higher quantities and much higher peak bandwidth figures than desktop or laptop CPUs. Having a high single core instruction rate wins you nothing if you can't get data to the CPU fast enough to hit that rate.
    As a simple check I had a look and the 8180 Xeon (closest I could find) has 6 memory channels, while your laptop CPU would (hopefully) have 2 channels set up (or could have been poorly designed and only have one). The server has 3 times the memory bandwidth of your laptop. That will make a massive difference for memory intensive tasks.
  • Hard disk. Faster hard disks, SDDs and so on can make a big difference in getting data to the memory for the CPU to work on. An SSD is orders of magnitude faster seeking for small bits of data, and bulk transfer is also much higher too. NVMe is even faster again. Servers often use RAID for backup or can be set up for raw speed. While they may both be NVMe a server farm may well have enterprise class disks in a RAID 0 or 01 and be faster than your single disk, particularly likely on shared machines where minimal impact across VMs is desirable.
  • Thermal limiting. Benchmarks, especially on laptops and ultra-portable machines, only tend to last long enough to see the initial ramp-up of performance. Over time heat reservoirs become full as fans can't keep up with heat output, and that initial turbo-boost speed drops down to the "normal" peak clock frequency. This can skew benchmark results and make a laptop look a lot better than it will perform under long term loads. Servers tend to have over-specified (and loud) cooling systems to ensure performance, laptops are designed for quiet home comfort and the fans are far less powerful. What you see in a benchmark may not have the same thermal limiting as what you have in front of you, yours may not perform as well and may limit sooner.
  • Bottlenecks. Servers will have far more I/Os than laptops. More PCIe channels, more dedicated IO ports and much higher bandwidth to peripherals meaning more data in flight down uncontested paths. Multiple PCIe devices contending for time on a multiplexer connected to a 16-lane CPU will be slower than a CPU which has 40+ dedicated lanes.
  • Cores. Having more cores makes a difference to not only the task you are doing on one core, but means that the tasks are not fighting for time. The tradeoff is that is is easier to hit memory bandwidth limits with more cores vying for bus time.
  • Caches. Server CPUs tend to have much larger CPU caches. While this is more of an optimisation, larger caches do mean less time going to memory and allow the CPU to hit their peak performance more than a smaller cache. A single core benchmark is probably small enough to fit in most cache sizes and so tells you nothing about the rest of the system.
  • Graphics. Related to PCIe/memory bus contention, your laptop will be doing graphics work, most likely with an iGPU. That means your system memory is being used (and memory bandwidth stolen) in order to drive a graphical display. The server would likely have none of that, most likely being a headless node in a compute cluster. The server has far less graphical overhead.

Consumer class CPUs are indeed powerful, but server class has far more logic, control and bandwidth to the wider system. Generally though, that is fine. We don't expect a 15 watt processor to perform the same as a 10x more expensive CPU with a 140 watt power budget. That extra power budget gives a lot more freedom.

If server CPUs had the same performance as a desktop or laptop CPU, then there wouldn't be a distinction between the two.

Just to further nail home the point: a similar single core score just tells you that the cores are reasonably comparable under ideal conditions. They may be theoretically close in terms of performance, but it doesn't tell you anything about the wider system and what the CPU is capable of when tied to other components. Single core speed is artificially focused on one small point in the system, more so than most normal uses of a system will encounter.

For more information on why one system is "better" than another you need to look more at so-called "real world" benchmarks, which will show (still artificial but) more comparable system performance metrics and hopefully give some idea where bottlenecks might lie. Better yet you do the kind of test you did which shows that for that workload a server class system, with it's underlying architecture and components, is much better.