Localhost tcp throughput performance differences

I have been using psping for measuring bandwith statistics on localhost on different computers: laptops, home computers and servers. All of them get to between 100 and 200 MB/s but my PowerPc at work manages to get to 800 MB/s.

What can cause these huge differences communicating with itself over localhost? The PowerPc outperforms all other devices I've tested with a factor 4 to 8.

PowerPc configuration

  • Windows 7
  • Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz [Family 6 Model 60 Stepping 3]
  • 3.78 GFLOPS/core
  • Symantec SEP

A Home computer configuration

  • Windows 8.1
  • Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 5] 4
  • 2.64 GFLOPS/core
  • BitDefender

psping command

psping -4 -b -l 8k -n 20000 localhost:1234

A few questions I expect to get and like to address up front

I might be way off but this shows you my current understanding of things, feel free to set me straight.
  1. Antivirus related
    I have turned of the antivirus component om my Home Computer without any noticable difference. Further, I have captured a WPA trace (XperfScripts) and the modules that have most CPU related activity are ntoskrnl.exe, netio.sys, tcpip.sys, ndis.sys and afd.sys. The first AV module that comes into the picture CPU related is avcuf32.dll accounting for 0.17% of total CPU.

  2. Localhost vs. 127.0.0.1
    I have tried both and got the same results on all tested computers.

  3. Up-to-date drivers
    The drivers on my Home computer are up-to-date. The drivers on the PowerPc are managed by our IT staff and lag behind somewhat but not that much (and the PowerPc is 4x faster on the tests anyway)

  4. netsh int tcp show global
    There are some differences between both PC's. Chimney Offload State and NetDMA State on my Home Computer are disabled while on the PowerPc they are automatic and enabled.
    My networking-fu is not good enough to know if that could explain for the differences but reading up on the subject, I doubt it is.


Edit

RAM details PowerPC

  capacity speed memorytype totalwidth datawidth typedetail
  -------- ----- ---------- ---------- --------- ----------
4294967296  1600          0         64        64        128
4294967296  1600          0         64        64        128
4294967296  1600          0         64        64        128
4294967296  1600          0         64        64        128

RAM details Home computer

    capacity speed memorytype totalwidth datawidth typedetail
  -------- ----- ---------- ---------- --------- ----------
2147483648  1333          1         72        64          2
4294967296  1333          1         72        64          2
2147483648  1333          1         72        64          2
4294967296  1333          1         72        64          2
2147483648  1333          1         72        64          2
4294967296  1333          1         72        64          2
   4194304    33         11          8         8       4096

I believe that the reason that the PowerPC on Windows 7 is much faster on localhost loopback throughput, is because it can use NetDMA.

The Microsoft article NetDMA (Windows Drivers) defines NetDMA as :

The NetDMA interface provides a generic interface for memory-to-memory direct memory access (DMA) transfers. Although the interface is designed to copy packets that are received from high-performance network interface cards (NICs), you can also use the interface for other applications. There is no direct relationship between NetDMA and NDIS.

When using localhost loopback, it stands to reason that memory copy operations are the main factor of throughput, as frames are copied from the source-application memory, then between TCP layers and finally to the memory of the target-application.

NetDMA can have an impact, since it allows network adapters to transfer data directly to your application, perhaps this way reducing the number of memory copies even for the trivial loopback adapter.

Enabling NetDMA can be done in two ways :

  1. Enter netsh int tcp set global netdma=enabled in Command Prompt (cmd) that is run as Administrator, then reboot.
  2. Regedit to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters and create a new DWORD item named EnableTCPA with the value 1, then reboot.

However, there are two prerequisites to enabling NetDMA :

  1. The Microsoft article Enabling NetDMA has this :

NetDMA must be enabled in the BIOS before performing this procedure. NetDMA support is often labeled IOAT support.

  1. The Microsoft article NetDMA (Windows Drivers) has this note :

The NetDMA interface is not supported in Windows 8 and later.

Putting both these requirements together, I can hazard the guess that, as NetDMA is a BIOS function, it was not implemented in UEFI which is used in Windows 8/2012.

Microsoft had therefore to improve localhost loopback throughput in another way, especially for using in Hyper-V, and had therefore created in Windows 8/2012 the Fast TCP Loopback, defined as :

TCP Loopback Fast Path is a new feature introduced in Windows Server 2012 and Windows 8. If you use the TCP loopback interface for inter-process communications (IPC), you may be interested in the improved performance, improved predictability, and reduced latency the TCP Loopback Fast Path can provide. This feature preserves TCP socket semantics and platform capabilities including the Windows Filtering Platform (WFP), and works on both non-virtualized and virtualized operating system instances.

The TCP loopback interface provides a simple local IPC mechanism for processes on the same operating system instance, and it can easily be switched to a remote IPC mechanism by simply changing the destination IP address.

Unfortunately, Fast TCP Loopback is not transparent, requiring applications to issue a WSAIoctl system call on the sockets for both sender and receiver, therefore not being backward-compatible with existing bandwidth-measuring applications such as PsPing and PCATTCP.

In my own tests on Windows 7, I have not fathomed all the mysteries surrounding NetDMA, but I have managed to briefly turn it on, with the immediate benefit of doubling my bandwidth as measured by PsPing. But as NetDMA did not survive a reboot on that computer, I do not recommend depending on it for throughput even on computers that theoretically support it.