What can cause network packets consisting of PUU?

We have a system which is suffering from comms outages on a gigabit ethernet network. The traffic load on the network is such as to slightly stress a 100Mb network, but there are gigabit switches and NICs and cables throughout - or so I am told by the customer who built the network we are plugging into.

We plugged in a laptop running Wireshark via a 100baseT hub and found that it reported lots of "Ethernet II" packets where the raw data, when displayed as ASCII, basically looks like this:

PUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

Naturally I immediately named this issue "Network PUU" and many giggles ensued. We're all in our forties or so, but I guess some of us never grow up (guilty!)

Anyway, more seriously, other perfectly valid packets were being corrupted by this data. IPv4 headers were getting bytes replaced with U bytes as well as there being data corruption which would cause the software to reject the data, even if the IP checksums didn't fail to match. We are pretty sure that this data spewing onto the network is causing the comms outages. What we don't know is where it might be coming from.

Has anyone ever seen this happen before? Did you solve it? Did you figure out where it came from?

====EDITED====

Added mention of the hub to the original description since, judging from the comments below, it is the most likely source of the corruption! The tool we used to try and find the network issue appears to have added a new and worse network issue.


Solution 1:

Anyway, more seriously, other perfectly valid packets were being corrupted by this data. IPv4 headers were getting bytes replaced with U bytes as well as there being data corruption which would cause the software to reject the data, even if the IP checksums didn't fail to match.

It's surprising that just alternating bits (U is ASCII 0x55 or 01010101b) actually make up valid Ethernet frames or even valid IP packets. If this corruption crawls into mainly intact frames/packets as well, it can only be caused by - most likely - a faulty switch (bad buffer memory) or a faulty host (NIC or RAM).

If frame data is corrupted in transport, on the cable, the FCS extremely likely fails to verify, making the very next switch drop that frame. However, if such a frame is transported through the network with a valid FCS, it must have been corrupted before that FCS was calculated, which mandates a defective switch or host.

You'll need to trace back that traffic. If the source MAC address isn't valid or can't be checked on intermediate (unmanaged) switches you'll need to trace your way back along the cables.

Solution 2:

Sounds like you have a bad NIC card. If the source MAC address valid, you can find it by checking the switch MAC tables. If it is corrupted, you'll just have to start unplugging devices to find it.

Solution 3:

That sounds as if you have a device (probably a 100 Mb/s switch) somewhere that can't deal with the traffic-flow and starts corrupting packets when its internal buffers overflow.
(Or it just has a bad RAM).

It doesn't notice it has corrupt packets and will happily be re-transmitting them, with freshly calculated new checksums. So the bad packets are accepted by other switches (checksum is good, switches don't care that the content is non-sense) and forwarded through the entire network.

It is actually worse than that:
Consider how switches learn which device (mac-address) is behind which port. Any packet destined for a mac-address which isn't learned yet by the switch is flooded to all switch-ports (except the one it came in from). This effectively turns a packet for an unlearned mac-address into a temporary broadcast.
Because your switches will never learn these mac-addresses (after all they are corruption, not real mac-addresses) they are ALL treated like broadcasts...
This essentially floods the whole network with un-deliverable packets.
(And note that normal broadcast-storm mitigations don't work in this case. They only act on REAL broadcast packets, not on these learning-floods.)

The only way to troubleshoot this is to disable 1 switch at a time and see if that makes the problem go away. If you can narrow it down to 1 switch it will be that switch itself or a device connected behind that switch.