Outbound Packets Dropping / Timeouts - Only with Azure

I have an issue with packets dropping to a third party data center in Florida, USA. The issue only occurs on Azure Virtual Machines, no matter which data center the VM is in. I've done the same tests simultaneously from other non-Azure networks, and there is no packets loss. The Azure Virtual Machines were "vanilla" / out of the box with no software loaded or other customizations / changes.

I've already spoken to the network admins at the data center and the only packets they are seeing are the ones that don't timeout; the packets that timeout never reach their firewall, so it sounds like something on the Azure side (especially since the packets consistently drop/timeout from multiple Azure data centers / regions). Does anyone know how I might solve this?

The test I was running was a continuous TCP ping (using tcping.exe) to port 80 (since ICMP is blocked on Azure):

tcping -t 216.155.111.149 80
tcping -t 216.155.111.151 80
tcping -t 216.155.111.146 80

Other evidence supporting the fact that it's not the third party data center is that I can run the same continuous TCP ping from my home computer / work computer and drop no packets. I also setup a tunnel VPN from the Azure VM to a VM at a non-Azure data center and no packets are dropped. The only time packets are dropped is when the traffic goes out to the internet/WAN directly via Azure.

I know the next step would be some trace route tests, but since Azure blocks ICMP, I had to use nmap to run a TCP trace route; pasted below are the screenshots from those tests.

nmap -sS -p 80 -Pn --traceroute 216.155.111.149

test1

test2

test3

test4


Solution 1:

As I've mentioned on my comment, you're effectively hitting a similar scenario as described in this article.

I could easily reproduce your behaviour:

Issue reproduced

And I could easily work around the issue by adding an Instance-Level Public IP to the VM:

Issue solved

It is difficult to say what is exactly going on, as we don't have simultaneous captures, but my understanding is that the edge device (potentially a firewall) on the remote site (www.oandp.com) keeps closed connections on it's connection table for longer than Azure does, so when Azure uses one of the freed (i.e. already used) ports and the remote side still thinks that connection is not fully closed, our SYN packets get dropped.

The ILPIP applies a static NAT or a "one to one NAT", hence there's no port translation nor port reuse (unless your OS does it), thus avoiding the issue.