Is it possible to process millions of datagrams per second with Windows, using a vendor agnostic API? [closed]

I am investigating if I can implement an HPC app in Windows that receives small UDP multicast datagrams (mostly 100-400 Bytes) at a high rate, using a dozen or up to maybe 200 multicast groups (i.e. using MSI-X and RSS I can scale to multiple cores), does some processing per packet, and then sends it out. Sending via TCP I managed to go up as far as I needed to (6.4Gb/sec) without hitting a wall, but receiving datagrams at high pps rates turned out to be an issue.

In a recent test on a high-spec NUMA machine with a 2-port 10Gb ethernet NIC on Windows 2012 R2, I was only able to receive hundreds of thousands of UDP datagrams per second (early drop, i.e. without actually processing the data, to remove the processing overhead of my app from the equation to see how fast it gets) using 2x12 cores, and the kernel part of the 12 multicast groups tested seemed to get distributed across 8 or 10 cores of one NUMA node (Max RSS queues was set to 16) - albeit with a .net app, so native apps should be able to go faster.

But even Len Holgate only managed to receive UDP packets at 500kpps in his high-performance Windows RIO tests, using a UDP payload of 1024 bytes.

In QLogic's whitepaper (OS under test not mentioned) the limits for "multi-threaded super-small packet routing" (so that includes both receiving and subsequent sending?) are set at 5.7Mpps. In articles on Linux networking, the limits are set at 1Mpps to 2Mpps per core (reportedly scaling up more or less linearly), or even 15Mpps with special solutions that bypass the kernel.

E.g. netmap

can generate traffic at line rate (14.88Mpps) on a 10GigE link with just a single core running at 900Mhz. This equals to about 60-65 clock cycles per packet, and scales well with cores and clock frequency (with 4 cores, line rate is achieved at less than 450 MHz). Similar rates are reached on the receive side.

So how far can I take Windows 2012 R2, with good, standard Ethernet NICs, doing standard Ethernet (rather than e.g. Converged Ethernet), using vendor-agnostic APIs?


Solution 1:

Its possible to bypass the kernel and use netdirect with hpc installed. see https://msdn.microsoft.com/en-us/library/cc904344(v=vs.85).aspx I cannot locate any perf data (I suspect it would vary per vendor since it uses NIC hardware more directly than the other APIs) but it should be on par with other kernel bypass solutions in Linux (kernel bypass is kernel bypass) EDIT: So if you are going to refuse to actually use the hardware provided don't expect the performance you can get from a standard NIC using the drivers required. No converged Ethernet is required (I'm not sure how that came up), but that's how venders expose the hardware features to the OS- drivers, I'm not sure why you even refer to the qlogic paper (which is specifically referring to using their nic hardware- something you are saying you don't want to do) same with the netmap paper (uses modified drivers).