How do I diagnose network corruption on an Internet path?
I run a few hosts on network A that make requests to servers (which I don't own) on network B, somewhere across the Internet. Unfortunately, many of these requests get corrupted. If I make the requests over unencrypted HTTP, I get strange errors that hint at a corrupt request. If I make the requests over HTTPS, I get SSL-level errors. I can reproduce the problem by running:
sh -e -c 'while true; do curl $SERVER > /dev/null; sleep 1; done'
Usually within 20 requests, curl fails with an error like "Unknown SSL protocol error" or "tlsv1 alert decrypt error". I can reproduce this on multiple hosts in network A, accessing multiple servers on network B. But I cannot reproduce from network A to other servers, or from other hosts to network B. In those cases, the loop runs forever with no errors.
So it's pretty clear my TCP stream is getting corrupted between A and B. This has been going on for over 3 days, by the way.
First question: How can this plausibly happen? TCP has packet-level checksums, and corrupt packets passing the checksum should be much rarer than I am seeing. Also, if I run a network capture, I don't see many retransmits (according to wireshark's tcp.analysis.retransmit filter), which you would expect if packets were being corrupted and failing the TCP checksum. I guess some router must be doing higher-level data mangling (NAT? transparent proxy?) and corrupting the data but fixing the checksum?
Second question: Are there any tools I can use to isolate the problem? I can't find any. If I knew the network topology and I could find HTTPS servers behind each hop between A and B, I could run my test on them. But I don't. What other test would show up network corruption?
I've contacted the owners of network A and network B, but they haven't been helpful so far.
Update: To anyone suggesting what kind of buggy device might be in the path, is there any way to detect this other than contacting the owner?
First of all, it would be useful to see if you can replicate the data corruption using ping, rather than using TCP. Ping uses an ICMP echo, sends a known payload (that you can even specify if you need to) and will report in case the payload is corrupt when returned. At least, this is what the man page tells me.
You'll probably want to use a long packet size (maybe 1400 bytes or so) and see if you can specify a low interval, perhaps 0.1 seconds so that you can reproduce the error in a reasonable amount of time. These settings will generate approximately 15 kB/s of traffic to and from the server. (1400 bytes / 0.1 seconds + overhead)
So why use a ping instead of the TCP connection? Because, you can probably ping most hosts in the path between the server and your client, and you can therefore test only part of the path.
Starting by testing the full path (all the way to your server, to determine that the test reproduces your issue). Armed with a traceroute, you can then test only part of the path. Every test you make can divide your search space in half, and after a few tests you'll be able to find the hop that's causing your problems.
Caveat: This will not work quite the way you expect if the corruption is happening on the return path to test machine rather than on the forward path. Traceroute can only tell you what route your packets are taking to the server, not the path packets returning will take, and those paths are not neccessarilly the same. Still, it should be enough to get you somewhere.
Good luck!
Is anybody along the line using LAN/WAN Accelerators? These pieces of hardware sometimes screwup and have to be restarted and can be the source of corruption as well as performance issues.
Could there be a flakey IDS/IPS/proxy at either network that is mangling packets only to/from the other network? That would explain why it's not reproducible from or to different hosts.