How to debug troubles with unix domain sockets?

Ubuntu Server 10.04.2

$ uname -a
Linux my.local 2.6.32-30-generic-pae #59-Ubuntu SMP 
Tue Mar 1 23:01:33 UTC 2011 i686 GNU/Linux

It seems that my domain socket queue is overflowing, but I can't prove it.

I've got this stack nginx->[spawn-fcgi->multiwatch->]custom-fcgi-service

Nginx is communicating with custom-fcgi-service by the means of unix domain socket.

Today we've got slight increase in traffic, and suddenly my nginx error.log is full of eels:

2011/04/07 15:31:51 [error] 28187#0: *469350 connect() to unix:/tmp/my.socket 
failed (11: Resource temporarily unavailable) while connecting to upstream, 
client: [IP witheld], server: my.local, request: "GET /myurl HTTP/1.0", 
upstream: "fastcgi://unix:/tmp/my.socket:", host: "example.com"

Some requests make it through, but many return 5xx error.

If I restart custom-fcgi-service, error goes away, but soon enough reappears. After inspecting custom-fcgi-service status, I'm reasonably sure that it works OK (though may be too slow for this amount of traffic, but that is a mere hypothesis).

I've tried doing this:

echo 65535 > /proc/sys/net/unix/max_dgram_qlen

But it did not help much. (Not sure if time-to-error became longer, may be, but not enough to fix it.)

If I increase number of worker forks of custom-fcgi-service, error does not appear for longer time, but so far I was not able to increase number of workers high enough to fix it for ever. CPU and memory and IO load on that machine are well within limits, so, again, I think that custom-fcgi-service is just being slow on some subsequent network call.

Question is: how to debug this issue? And if it is indeed socket queue length, how to make a sensor that will warn us that we need to fork more custom-fcgi-service workers?


Solution 1:

It seems like you have problem with connect, not with send. Try to increase kernel receiver backlog:

echo "2000" > /proc/sys/net/core/netdev_max_backlog

or

sysctl –w sys.net.core.netdev_max_backlog=2000

Have you checked system logs (e.g. dmesg)?