nginx clobbering sftp traffic during peak times - is tc the answer?

A couple of things jump out at me...

  • You're not maxing-out or approaching the bandwidth limits, are you?
  • Have you looked at system entropy pool levels during the period of slow sftp performance (check /proc/sys/kernel/random/entropy_avail)? E.g. are your nginx sessions doing a lot of SSL requests? That can have a clear effect on other services that use encryption.
  • There are some sysctl.conf tuning parameters that may help (tcp window size?), but sftp isn't terribly efficient. Is scp an option? How large are the files?
  • DNS? Are you encountering reverse-lookup delays? Do you have any control over who's connecting to you? If it's predictable, try a stub entry for the source IP's in /etc/hosts to see if that helps.

So it turns out I had at least three different problems masking one another. Here's what I did to solve the problems:

  1. Prioritize ICMP and ingoing/outgoing traffic on port 22 (as shown in my question above). This boosts sftp responsiveness (e.g., ls) and also transmission throughput during peak times.

  2. Solve the entropy shortage by installing the haveged package via Debian backports. This solves the "hang for several minutes at select()" issue. ewwhite++

  3. Add UseDNS no to /etc/ssh/sshd_config and rehash sshd. This solves the sftp delay at 5 second intervals during peak times. Sergey Vlasov++

Remaining mysteries:

  • My host initially configured /etc/resolv.conf for me, adding two of their nameservers as primaries. It's understandable that one or more of these nameservers are overloaded during peak times (i.e., during the day in the US), resulting in the 5 second interval delays I noticed on my sftp latency graphs. However, why does sftp perform a reverse DNS lookup every time I transfer a file? Were these simply cases when the reverse lookup timed out on the initial connect, and then on the first transfer, the sftp subsystem tried again and again failed to reverse my IP? Does the system not try the secondary nameservers in this case? At any rate, I've now added some well-known public nameservers as primaries over my ISP's overloaded ones, so other possible applications running on this same server won't have problems with DNS during peak times.

  • What is consuming entropy on my server? I couldn't find any evidence that stock nginx (serving static files) calls rand(), and yet that seems to be exactly what is happening. Is it the filesystem (ext3/4) or is another part of the kernel involved somehow?

Anyway, this is good enough for now. Thanks to this community, I was able to solve one of the most annoying and persistent problems I've encountered in over ten years of unix web server administration.