Ensuring consistency and minimal interference during benchmarking?

I am attempting to measure the performance of a common piece of software on the current line up of 13” and 16" for the purpose of comparison between models.

Analysing the specs “on paper” may not be ideal since CPUs and memory can perform differently depending on how the software utilizes them, and obvious metrics (clock speeds, cores, memory type and memory GB) don't always correlate with actual performance.

I am working on some simple, reproducible tests to benchmark performance on across MBPs and generate some real, empirical data. They will resemble real-life workflows, and be focussed on heavy compute and heavy memory use. All up they will take under 2 hours per machine, and they don't require Internet.

Question: how should each MBP be prepared before each test is run? There is a strong desire to ensure maximum consistency/accuracy across tests and therefore a fair comparison.

Some things I have considered so far:

  • WiFi and bluetooth off
  • Consistent room temperature and (sun) lighting conditions
  • Close as many unnecessary applications as possible, leaving open only what is necessary for the OS and for the testing software
  • I am not sure what else, or if there is any established 'best practice'. I am particularly concerned about any 'background' tasks which I may not be aware of which could run during the test and affect the results

Solution 1:

Items you've considered are already good! Disabling internet stops lot of stray processes.

You can try Safe mode to do the tests if they don't need the services disabled in that mode. Also consider making a new user account for doing the testing.

https://support.apple.com/HT201262

Safe mode prevents your Mac from loading certain software as it starts up, including login items, system extensions not required by macOS, and fonts not installed by macOS. It also does a basic check of your startup disk, similar to using First Aid in Disk Utility. And it deletes some system caches, including font caches and the kernel cache, which are automatically created again as needed.

That's the closest you can get to a brand new Mac.

Also, if the tests' results use huge storage, its best to delete them after every run since remaining storage also affects VM usage and increases memory pressure.

Solution 2:

To do any sort of benchmarking consistency is the key.

When benchmarking your software whether it’s to be published as part of your marketing material or not, it needs to be consistent across technical specifications, not just “line” (MacBook Pro 13”).

I am attempting to measure the performance of a common piece of software on the current line up of 13” and 16" for the purpose of comparison between models.

The problem is technical configurations change significantly within that line. My 2020 13” MacBook Pro may be vastly different than your 13” MacBook Pro. Even if you choose to use a wide brush to describe the performance statistics, it may be “too wide a brush” because the CPU, GPU, and memory configurations can vary by such a wide degree. Because of this saying “we get X operations on the 13” model and Y operations on the 16” model” won’t have much relevance because the natural question will be “which one of the 13” models did you test.”

Critical & Non-Critical Factors

  • CPU: The CPU can vary from an i5 1.4GHz to an i9 2.3GHz. Just that alone can affect your performance (and i7 2.3 and an i9 2.3 are very different in terms of speed.

  • Memory: It’s not as critical as you’d like to think unless there’s memory pressure because you’re maxing out. For example generally, if you’re using a total of 4GB of RAM having 8 or 16GB total isn’t going to matter, but if you’re pushing 8GB and you’re comparing machines with 8 and 16GB, you will see a performance difference.

However, from a marketing standpoint (what the customer reads), you want to remain consistent. You don’t want them asking “why did you use 8 here and 16 there?” You want to avoid giving the customer “cans of worms to open.”

  • GPU: Critical if your machine makes use of the GPU in some way - for graphics rendering or for number crunching. However, like memory, you want to remain consistent here too as it’s also “a can of worms you don’t want them opening.”

  • Storage: Not critical at all. Whether you run this on a MacBook Pro with a 256GB SSD or a 512GB SSD it’s not going to make any difference - storage performance isn’t measured in size, but in IOPS or *Input/Output per Second. So, unless your app depends on the speed of reading/writing to a drive, you can feel confident ignoring this factor.

  • Touch bar: Yep! The Touch bar. While it does have it’s own “processor” to drive it, it still has to “sync up” with the rest of the OS and that requires CPU cycles (as few as they may be). You don’t want something leeching CPU cycles from one test machine and not another; make sure your consistent on having one or not having one.

  • WiFi: Generally not a concern especially if network connectivity is not an issue. If your app doesn’t need it, turn it off for consistency. This too, is like storage; unless critical, you can safely ignore it

  • Peripheral devices: mouse, keyboard, external drives, etc. Most of these utilize very little in the way of CPU cycles, but again you don’t want a performance leech affecting your data. However, if you must use one, like a USB-C to HDMI or Ethernet adapter, make sure you are using identical ones for each machine. There’s no one USB bridge (interface between USB and the device) or even one SATA (for drives) or Ethernet (for network) controller. Manufacturers are notorious for turning off features (even if the same chip is used) or using poor performing chips that require more CPU cycles to compensate. If you must use one of these, make sure they are the same brand and model.

  • OS: this is very critical. You want a clean (meaning fresh install) with no 3rd party anything installed. Safe mode is good in a pinch, but it doesn’t account for System Preferences customizations you may have made. The OS needs to be the same down to the build number.

  • Environment. Somewhat critical. Ideally, you want to run in a “room temperature room” where the heat factors don’t change significantly (i.e. a room with lots of windows). Unless you’re testing in the arctic or in a desert, any temperature controlled room is fine.

TL;DR

Benchmarking requires repeatable consistency. However, you configure your devices, you must make every attempt to ensure a consistent test environment where you can perform multiple iterations of the benchmark without worry that your factors are inconsistent.

The worst case scenario is you get the guy who likes to play “stump the tech” in a public forum and your data is inconsistent. And as mentioned earlier, you don’t want to give your customers “cans of worms” to open because, ultimately, you’ll be the one to clean them all up.