Is it possible to process millions of datagrams per second with Windows?
Solution 1:
According to Microsoft, tests in their lab showed that "on a particular server in early testing" of RIO, they were able to handle
- 2Mpps without loss in Windows Server 2008R2, i.e. without RIO
- 4Mpps on (pre-release) Windows Server 8 using RIO
Screenshot from that video (44:33):
So the answer to my question Is it possible to process millions of datagrams per second with Windows?
would be: Yes, and apparently it was even before RIO, in Windows Server 2008R2.
But in addition to official figures, especially on unreleased software, having to be taken with a pinch of salt, with only the sparse information given in this presentation, many questions about the test, and hence how to properly interpret the results, remain. The most relevant ones being:
- Are the figures for Sending? Receiving? Or maybe for Routing (i.e. Receive + Send)?
- What packet size? -> Probably the lowest possible, as is generally done when trying to get pps figures to brag about
- How many connections (if TCP) / packet streams (if UDP)? -> Probably as many as necessary to distribute the workload so all cores present can be used
- What test setup? Machine and NIC specs and wiring
The first one is crucial, because Sends and Receives require different steps and can show substantial differences in performance. For the other figures, we can probably assume that the lowest packet size, with at least one connection/packet stream per core was being used on a high-spec machine to get the maximum possible Mpps figures.
Edit I just stumbled upon an Intel document on High Performance Packet Processing on Linux, and according to that, the (Linux)
platform can sustain a transaction rate of about 2M transactions per second
using the standard Linux networking stack (on a physical host with 2x8 cores). A transaction in this request/reply test includes both
- reception of a UDP packet
- subsequent forwarding of that packet
(using netperf's netserver). The test was running 100 transactions in parallel. There are many more details in the paper, for those interested. I wish we had something like this for Windows to compare... Anyway, here's the most relevant chart for that request/reply test:
Solution 2:
tl;dr
To give a definite answer, more tests seem necessary. But circumstantial evidence suggests Linux is the OS used practically exclusively in the ultra low latency community, which also routinely processes Mpps workloads. That does not mean it is impossible with Windows, but Windows will probably lag behind quite a bit, even though it may be possible to achieve Mpps numbers. But that needs testing to be ascertained, and e.g. to figure out at what (CPU) cost those numbers can be achieved.
N.B. This is not an answer I intend to accept. It is intended to give anyone interested in an answer to the question some hints about where we stand and where to investigate further.
Len Holgate, who according to google seems to be the only one who has tested RIO to get more performance out of Windows networking (and published the results), just clarified in a comment on his blog that he was using a single IP/Port combo for sending the UDP packets.
In other words, his results should be somewhat comparable to the single core figures in tests on Linux (although he is using 8 threads - which, without having checked his code yet, seems harmful for performance when handling just a single UDP packet stream and not doing any heavy processing of the packets, and he mentions only few threads are actually used, which would make sense). That is despite him saying:
I wasn't trying that hard to get maximum performance just to compare relative performance between old and new APIs and so I wasn't that thorough in my testing.
But what is giving up the (relative) comfort zone of standard IOCP for the more rough RIO world other than "trying hard"? At least as far as a single UDP packet stream is concerned.
I guess what he means - as he did try various design approaches in several tests of RIO - is that he did not e.g. fine-tune NIC settings to squeeze out the last bit of performance. Which, e.g. in the case of Receive Buffer Size could potentially have a huge positive impact on UDP receive performance and packet loss figures.
The problem however when trying to directly compare his results with those of other Linux/Unix/BSD tests is this: Most tests, when trying to push the "packets per second" boundary, use the smallest possible packet/frame size, i.e. an Ethernet frame of 64 bytes. Len tested 1024 byte packets (-> a 1070 byte frame), which (especially for No-Nagle UDP) can get you much higher "bits per second" figures, but may not push the pps boundary as far is at could with smaller packets. So it would not be fair to compare these figures as is.
Summing up the results of my quest into Windows UDP receive performance so far:
- No one really is using Windows when trying to develep ultra low latency and/or high throughput applications, these days they are using Linux
- Practically all performance tests and reports with actual results (i.e. not mere product advertisement) these days are on Linux or BSD (thanks Len for being a pioneer and giving us at least one point of reference!)
- Is UDP (standard sockets) on Windows faster/slower than on Linux? I don't can't tell yet, would have to do my own testing
- Is high-performance UDP (RIO vs netmap) on Windows faster/slower than on Linux? Linux easily handles full 10Gb line speed with a single core at 900MHz, Windows, in the best case published is able to go up to 43% or 492kpps for a large UDP packet size of 1024, i.e. bps figures for smaller sizes will probably be significantly worse, although pps figures will probably rise (unless interrupt handling or some other kernel space overhead is the limiting factor).
As to why they use linux, that must be because developing solutions that involve kernel changes like netmap or RIO - necessary when pushing performance to the limits - is near impossible with a closed system like Windows, unless your paychecks happen to come out of Redmond, or you have some special contract with Microsoft. Which is why RIO is a MS product.
Finally, just to give a few extreme examples of what I discovered was and is going on in Linux land:
Already 15 years ago, some were receiving 680kpps using a 800 mHz Pentium III CPU, 133 mHz front-side bus on a 1GbE NIC. Edit: They were using Click, a kernel-mode router that bypasses much of the standard network stack, i.e. they "cheated".
In 2013, Argon Design managed to get
tick to trade latencies as low as 35ns [nano seconds]
Btw they also claim that
The vast majority of existing computing code for trading today is written for Linux on x86 processor architectures.
and Argon use the Arista 7124FX switch, that (in addition to an FPGA) has an OS
built on top of a standard Linux kernel.