Large delay when fetching a page from a particular site

I have the following problem: when I retrieve a page from Hackage, I get a large delay (about 30 seconds). Further requests are fast, but if I don't connect to it during a couple of minutes, the problem comes back.

What's interesting about this problem is:

  • it is specific to this particular site (Hackage) — I don't get a similar problem with any other site (and I visit quite a few);
  • it seems to be specific to my ISP — when I connect from other places, there's no such problem;
  • it's not related to DNS or connectivity problems — in fact, the TCP connection is established quickly; it's the HTTP response that takes too long, as can be seen from the following sample packet capture:

      1 0.000000000 192.168.1.101 -> 66.193.37.204 TCP 66 41518 > http [SYN] Seq=0 Win=13600 Len=0 MSS=1360 SACK_PERM=1 WS=16
      2 0.205708000 66.193.37.204 -> 192.168.1.101 TCP 66 http > 41518 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1440 SACK_PERM=1 WS=128
      3 0.205759000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=1 Ack=1 Win=13600 Len=0
      4 0.205846000 192.168.1.101 -> 66.193.37.204 HTTP 158 GET /packages/hackage.html HTTP/1.1 
      5 0.406461000 66.193.37.204 -> 192.168.1.101 TCP 54 http > 41518 [ACK] Seq=1 Ack=105 Win=5888 Len=0
      6 28.433860000 66.193.37.204 -> 192.168.1.101 TCP 1494 [TCP segment of a reassembled PDU]
      7 28.433904000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=105 Ack=1441 Win=16480 Len=0
      8 28.434211000 66.193.37.204 -> 192.168.1.101 HTTP 1404 HTTP/1.1 200 OK  (text/html)
      9 28.434228000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=105 Ack=2791 Win=19360 Len=0
     10 28.434437000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [FIN, ACK] Seq=105 Ack=2791 Win=19360 Len=0
     11 28.635146000 66.193.37.204 -> 192.168.1.101 TCP 54 http > 41518 [FIN, ACK] Seq=2791 Ack=106 Win=5888 Len=0
     12 28.635191000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=106 Ack=2792 Win=19360 Len=0
    

    (packet capture in pcap-ng format). This capture shows what happens during a simple curl http://hackage.haskell.org/packages/hackage.html.

It also doesn't matter that I'm behind a router — it's the same when I connect directly. The connection type is PPPoE.

I reproduced the problem on 3 computers that run Linux and Windows.

How to diagnose such a problem?


Solution 1:

"30 seconds" and "after two minutes" are a dead ringer for a DNS issue to me.

If we suppose that the page you are connecting to does something like a DNS query on the connecting IP, and that query fails for some reason, you would see:

  • TCP connection almost instantaneous since the server is not doing DNS checks
  • the script runs a DNS query and gets stuck.
  • after 30 seconds the default timeout expires and the script goes on (you are now "Unknown")
  • on subsequent queries, the negative DNS hit is still cached and stage 1 is passed in next to no time
  • after negative timeout expires (RFC 2308), and that is anything between 2 and 5 minutes, a new query is issued on the next connect, and the story repeats.

...and these are exactly the symptoms you are describing.

You could try running a DNS query from another ISP (say, ISP2) on the IP you get from ISP1. It is not 100% proof, but I expect a high likelihood that the query will take 30 seconds to complete. That would mean that ISP1 DNS server is having problems answering to queries from the outside.

Another possible cause could be ISP1's DNS being firewalled out by Hackage for some (likely mistaken) reason (in my outfit, the reason would be "a trigger-happy netadmin", and I could name names). In that case you would have a much harder time diagnosing, for any tests through ISP2 would return nothing unusual; you'd have to escalate this to Hackage.