windws server 2008 udp multicast performance problem

I encounter a strange performance problem on 2008 R2 Enterprise SP1.

Here is the setup:

  • Many processes listening to distinct Multicast UDP streams (5 multicasts listen by process) bound on a single NIC
  • Across processes,all multicasts using the same port range but different multicast IPs (important detail,since each multicast receiver for a given port will be server of REUSED server socket)
  • Each process multicast listened bandwith is 10Mbits
  • RSS set on NIC , maximum offload settings set on NIC & OS , MSI activated

Behaviour:

  • Under 17 listening processes (about 85 joined UDP Multicasts), Kernel CPU impact is neglectible.
  • Between 17 & 22 listeners (about 110 joined UDP multicasts) , Kernel CPU usage begins to grow slowly but is acceptable
  • Above 25, each joined multicast begins to have huge impact in Kernel CPU time , this impacts all RSS bound CPUs
  • CPU time used per listening process is near 0 (normal since processes do nothing but reading the multicast) , so the real problem lies in the OS component

What we found:

  • Changing NIC hardware has no impact on behaviour (Tested on HP NC382i , Broadcom based NIC & HP NC365T , Quad Gigabit , Intel Based)
  • Global receive bandwith is not the limiting factor (Single 500Mbits stream does not trigger CPU load)
  • Reading on multicast socket seems not to be the limiting factor (we performed the test with just dumb JOIN only processes on the multicast streams and reproduced CPU load problem)
  • Splitting Multicast traffic on two NICs seems to limit CPU load & spread better. However this is not a use case for us.

Problem:

  • We need at least to be able to listen to about 500 multicast streams and maybe up to 750
  • Same hardware, running XP OS does not have this behavior in CPU Kernel time

Supected Component:

  • NDIS.sys seems to be a good candidate for explaining the CPU usage increase.

Have any of you encountered such problems and could give some direction to investigate. I've read all i could about win server 2008 network perf enhancement, but all seem to be linked to TCP traffic. I've also tested all possible optimizations that could be done via registry or netsh command.


That's a lot of multicast streams, typically NICs have a low limit for hardware filtering and when you exceed that they either drop everything (poor implementation on cheap NICs), or forward everything to the operating system for it to filter instead. When the operating system is performing the filtering your processor usage is going to sky rocket.

Aside of investigating different hardware, which you list some, you could extend to 10GigE based too, the only option is to use proxy servers.

By experimentation find a number of multicast streams which can be managed reliably, then forward the streams on via TCP to a central server or set of servers. That central server can then use TCP segmentation acceleration or full ToE to render the incoming network load insignificant to the processor.

I cannot get decent multicast rates with Broadcom hardware at all due to very poor Windows drivers. It would be interesting to see how Linux performs on the same hardware, that should give you a good indication of the hardware and IP stack quality.

You list Windows XP as working fine, the major difference between Windows Server and Windows XP is the quantum time. Windows Server gives longer quantum times, it might be worth investigating forcing a shorter quantum (if you can even set it).