Dropped packets, on recieve only, Server 2008 only, and network speed is 100mb/s

I have a really strange one.

I have packet loss with Excessive 'TCP Dup ACK' & 'TCP Fast Retransmission' when I download files (and only download) from two different Windows 2008 servers. Upload speed is fine.

This ONLY occurs if the client computers(Win7) is connected at 100mb/s. At 1GB, no errors and I get full speed. If I set the client nic to 100Mb/s, I get a lot of 'TCP Dup' errors and the download speed drops to around 2-5MB/s. Upload speed is 10MB/s or above.

This only happens to the Windows 2008 Server boxes (Dell, but different hardware). This problem does not occur if I transmit between the Win7 clients and the Linux servers.

It's like Server 2008 is unable to scale the TCP window properly, overloads the switch or something, then pauses traffic for a bit.

Parts of the network run at 100Mb/s due to older equipment, so this is really causing a problem in some buildings.

I have uploaded a pcap file from the client here. https://dl.dropboxusercontent.com/u/24907255/slow.pcap.gz

It shows a 50MB file being written to the server, then read back from the server with the errors.

Thanks for any help. I am stumped.


11/28/13 More Information.

I shutdown the entire network so that only one client and one server are on the network. No change in the problem.

If I set every interface, server, client and Cisco 2960 switch to 100Mbs full, then the problem goes away. If I set the server and switch interface auto or 1Gbs, the problem is back.

If I bypass the switch with a Netgear 10/100 switch and set both client and server to auto, I have no problems.

I did discover this. In the normal setup, with server to switch at 1Gbs, I plug in the Netgear 10/100 switch between the client and Cisco switch, my speed problem is even worse. Speeds go from 5-7MB/s to 2-3MB/s, and yes I have tried fixed and auto network speeds. This would explain why some of the buildings that have a 2 switch hop between them and the main Cisco switch have more of a speed problem.

On to pinging. With everything at 1GB/s, I can ping a full TCP payload, ping -l 65500 and it works. With the client at 100Mbs, the max size I can ping is 17752. Anymore and it fails, to the Windows servers only, no problem on the Linux boxes. With the Netgear 10/100 between the server and client, no problems pinging at 65500.


Update 3

I swapped in a PowerConnect 2748 switch. Same problem with the server at 1Gbs and the client at 100Mbs. I can ping over 17752 now tho. Strange. So I don't think it's the Cisco switch.


Update 4. I am trying to get some hard numbers by using ipref. All systems connected to the same switch, with the client set to 100Mbs and running the command ipref.exe -c -u -b 10m. So sending to the server. One server is 2008 with no load on it right now, other is a Ubuntu with a load avg of .20.

At 10m

  • Linux jitter 0.022ms, packet loss is 0/8505
  • Server 2008 jitter 1.859, packet loss 68/8505

Pushing it to 100m

  • Linux jitter 0.445, packet loss 0/26634
  • Server 2008 jitter 0.542, packet loss 94/26596

Now for stats sending TO the client at 10m

  • Linux jitter 0.271 ms, 0/ 8500 (0%) 1 datagrams received out-of-order
  • Server 2008 jitter .063, 20/8505 (0.24%)

Pushing it to 100m

  • Linux jitter 0.230 ms 4083/85443 (4.8%), 1 datagrams received out-of-order, 95.7Mbs
  • Server 2008 jitter 0.237, 28174/81718 (47%), 51.1mbs

So Server 2008 is poor in general, but you can see the huge packet loss 47% when the connection is pushed to the clients 100mbs limit.


Update 5.

When I tested with the PowerConnect 2748 switch, I used different cat5 cable between the server and switch and client and switch. This should rule out cabling or switch issues.

I have two Windows 2008 Servers in this environment, installed at different times, and on different hardware. The only thing they share is a Broadcom branded nic, but the chipset is different. Both experience the same problem, but I am doing my main testing on one so in case something goes wrong, the other will still work.

The one server has a built on BCM5709C with two ports, and an add-on card, pci express I think, card also with the same BCM5709C chipset and two ports. I have tried all of them and the problem still exist. So this should rule out any hardware problems.


Update 6 12/3/13 I installed the Intel nic. No change. I played around with the ctcp settings and no change there. I even turned off SMB2 and no difference.

I did some more testing at 100Mbs Copying a 3GB ISO image TO the server, drag and drop, averages out at 10MB/s. Copying the same 3GB ISO image FROM the server, averages out at 6.3MB/s.

With all network interfaces set to Auto and at 1Gbs. Copying the ISO TO the server, averages 101MB/s Copying the ISO FROM the server, averages 57MB/s

So read speeds from the server are almost half the write speeds.


Solution 1:

This sounds like a speed/duplex mismatch causing collisions and retransmits. Misconfiguration between the server and the other side could cause this. Another reason for the mismatch could be failing autonegotiation.

Make sure both ends of the connection are configured identically regarding speed and duplex.

Solution 2:

I believe you should investigate if any of the NIC driver/Windows NDIS offload settings relate to your problem. I am most suspicious of the LSO (Large Send Offload) function as I've seen it totally wreck a service (Dell server w. Broadcom NIC) in a manner which defied all troubleshooting book definitions of anything.

The actual effect of LSO when it disrupts rather than enhances, is that the LSO engine may pass larger data frames that the switch supports. This causes the switch to silently discard those frames. Needless to say this causes performance degradation and packet loss. The failure can be imminent, but can also be intermittent making it tremendously difficult to troubleshoot. This is described in detail here: Large Send Offload and Network Performance

Disclaimer: this is just best effort thoughts on a possible angle on your problem. Implementing any one of the changes below will disrupt your network communication. The computer should be restarted after applying any of the settings. I copy/paste the most interesting settings for reference, but the links contain all the hardcore info and caveats. I most strongly recommend using the official docs as the basis for change and this post at most like a checklist.

Before proceeding with any of this, back up your registry key of:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

One uncool reason is due to an official bug described below, which changes some unrelated values when certain settings are sent through the command line.

I freely admit that where settings are present in both the Windows NIC driver GUI and in Windows, I never really got clarity in if one has to disable both in the GUI and through Windows CMD/Registry, or if one suffices. The blogs I've read which presented an answer have been inconsistent with regards to some minor detail or other so I never was sure. Nowdays I attempt change everywhere I find the option for whichever setting I'm focusing on. The GUI options are not presented here, but are described in the official docs.

Also, different NIC drivers for the same card may present varying granularity in the advanced settings in the GUI.

Disabling Task Offloading

This registry setting disables task offloading as defined in Using Registry Values to Enable and Disable Connection Offloading.

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters\DisableTaskOffload
Setting this value to one disables all of the task offloads from the TCP/IP
transport. Setting this value to zero enables all of the task offloads.

If the above setting has any effect you could try going granular as specified in the link. There are quite a number of settings governing this so I won't paste them all in.

I'll supply the LSO ones though:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\LsoV1IPv4
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\LsoV2IPv4
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\LsoV2IPv6

For all three: Enabled = 1(default). Disabled = 0.

Disabling connection offloading

As defined in Using Registry Values to Enable and Disable Connection Offloading.

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters\TCPConnectionOffloadIPv4
Describes whether the device enabled or disabled the offload of TCP connections
over IPv4. Enabled = 1 (Default). Disabled = 0.

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters\TCPConnectionOffloadIPv6
Describes whether the device enabled or disabled the offload of TCP connections
over IPv6. Enabled = 1 (Default). Disabled = 0.

Disabling TCP Chimney, TOE and TSO

As specified in How to Disable TCP Chimney, TCPIP Offload Engine (TOE) or TCP Segmentation Offload (TSO) Note the Win2008 hotfix

and in Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008.

Windows 2008 Server:
If the operating system is Microsoft Windows Server 2008 (any version
including R2), run the following from a Command prompt:

1. netsh int tcp set global chimney=disabled
2. netsh int tcp set global rss=disabled
3. netsh int tcp set global netdma=disabled

Note: To display current global TCP settings, use the net shell command:
netsh int tcp show global

4. Restart the server.

Note: Microsoft has identified an issue running the netsh command to set global
TCP parameters on Windows Server 2008 and Vista machines.  Some global
parameters, such as TCPTimedWaitDelay, can be changed from their default or
manually set values to 0xffffffff.  Before running the above command, Symantec
recommends reviewing Microsoft KB Article 967224 (support.microsoft.com/kb/967224).
Upon completion of the above command's execution, Symantec also recommends
reviewing the TCP Parameters noted in the KB Article and applying the hotfix from
the article if needed.

` The hotfix describes the issue thus:

After you run the command, the values of the following unrelated settings are
changed to 0xFFFFFFFF:
KeepAliveInterval
KeepAliveTime
TcpTimedWaitDelay

In addition, the "TcpMaxDataRetransmissions" are changed to 0xFF.

Again, one may therefore wish to backup the entire registry key before doing anything:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

If you google you problem together with offloading highlights from above, you'll find no end to posts, articles and blogs describing similar issues due to NIC offloading. But if it still doesn't work then I guess you can move on up the stack to try other things out, because it isn't due to half broken cable, NIC or switchport, right?

Solution 3:

always look at the networking device for clues..... so, if cisco, do a "show interfaces f0/11" or whatever it may be in your case. retransmits can also be due to a bad ethernet port/nic/cable, such as due to "crosstalk"..... show int on the switch should show you these error stats, if thats the case, and it will be obviously way too high

EDIT: as this is microsoft, its most likely thats your problem, but other than that, in general, start at layer one (make sure phyical cables are good), and work your way up the stack, ... ie layer 2, speed/duplex/mac address fltering,.. then layer 3 ip/udp/tcp firewalling,...etc

Solution 4:

This can also be "advanced" NIC atributes, like PowerManagement ones or IRQ priority. Assuming you have the same version of drivers. Go to:

Device Manager -> Network Interfaces -> Properties for the NIC -> Advanced Tab.

Check and compare all values here.