How do I find the cause for a huge difference in performance between two identical Ubuntu servers?

Solution 1:

Two ideas, depending on how far you want to go with this:

  1. Swap the disks of both servers and see if the speed performance stays on the hardware or moves with the software.

  2. Compare the output of /opt/dell/toolkit/bin/syscfg -o complete-bios-config.out if you can somehow trick this package to install.

Solution 2:

More possibilities to output and diff:

  • sysctl -a (make sure kernel tuneables are the same)
  • cat /proc/interrupts (Maybe there is some other piece of hardware messing up?)
  • ipmitool sensor list (long shot, but check for more low level differences, overheating, voltage problems, etc)

Solution 3:

This sounds like it might be load-balancer related to me. When you say "same workload" how are you measuring this?
Are you directly benchmarking each server by applying a test load in isolation?
or Are you applying some load to the load-balancer and looking at the results on both servers?

If you're doing the latter (measuring the load placed on both servers through the load balancer) your load balancer may not be splitting the workload exactly evenly between the servers (a 20% skew for a pair of servers is not uncommon depending on how your load balancer decides who gets which requests), which is causing one server to take more load, and thus perform poorly.

(If you're directly benchmarking each server, in isolation, without using the load balancer as an intermediary, and you've verified that every component is identical (down to manufacturer revisions) between both systems then I'm at a loss -- I can't think of any other measurable reason for this kind of performance difference between otherwise identical servers)

Solution 4:

Try some profiling tools, either system profiling like perf or Java profiling like VisualVM.

With perf you could profile either the running Java process by PID or profile a benchmark. Look at both systems, see where the slow system is spending its time.

apt-get install linux-tools-common linux-tools

Then something like:

perf record -e cpu-cycles -p <pid>

or

perf record -a -g <benchmark command>

then

perf report

A couple ideas of how systems can perform differently:

Environment: Is the air temperature or airflow different? Are they in racks? I have seen systems perform differently in different rack positions, caused by vibration. There are different levels of vibration throughout each rack. It's unlikely, considering you said there is almost no I/O being used. But I have seen disks slow down to 2MB/sec sequential writes due to vibration in parts of a rack.

Hardware Faults: Any of the hardware could be faulty. Use the profiling to see what is slow. It could be a bad CPU or chipset, a heatsink not attached properly, out of balance fans causing vibration, failed fans, even a bad PSU. Try swapping things that are easy to swap.