Processes hanging indefinitely when reading from network connections

An update to the below:

I have experienced a similar problem on an unrelated script, on a Debian virtual machine in a different datacentre.

This looks suspiciously like the issue described here (and like the person asking that question, I do not have a proxy configured in front of the server).

The main difference to the description below is that when I attach to the hung process, I see a call to recvfrom rather than read:

$ strace -p 17527
Process 17527 attached - interrupt to quit
recvfrom(3,

However Python is not under any impression that it is being proxied:

>>> import os; print os.getenv("HTTP_PROXY"), os.getenv("http_proxy")
None, None

So I'm still stumped. Sadly, the linked question is without a final answer as well.

(I'm also wondering if this question is related, but it seems unlikely that S3 would be failing to honor Connection: close headers.)

I have several Debian (Wheezy, x86_64) servers all exhibiting the following behavior:

All of the servers have a set of cron jobs that, among other things, pull data from S3. These usually run fine, but occasionally ps aux reveals that some of the jobs started hours or days ago are still running, and have not finished cleanly.

Inspecting them with strace -p <pid> shows, in all cases, the process is hung on a read command. For example, the output of a process I checked just now was:

$ strace -p 12089
Process 12089 attached - interrupt to quit
read(5,

And checking the open file descriptors gives me this:

$ sudo lsof -i | grep 12089
python  12089    user    5u  IPv4 809917771      0t0  TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)

At first I assumed this was just due to a lack of setting a read timeout in the Python scripts, but this does not appear to be the case, for a couple of reasons:

This doesn't happen when the same jobs are run on our OS X boxes (all 10.5, i386), using identical code.
A variant of the script that does set a timeout (of 60 seconds, using socket.setdefaulttimeout -- this is in Python 2.7, but the code base has to be 2.5 compatible) has been hung since yesterday.
Another process that isn't Python seems to occasionally exhibit similar behaviour. In this case, a Python script is executing an svn up --non-interactive process (using subprocess.Popen, for what that's worth).

The situation with that SVN process is similar--

Python is waiting for SVN:

$ strace -p 28034
Process 28034 attached - interrupt to quit   
wait4(28127,

And SVN is waiting for a read call to complete:

$ strace -p 28127
Process 28127 attached - interrupt to quit
read(6,

And that read is pointing to another external host:

$ sudo lsof -i | grep 28127
svn     28127    user    3u  IPv4 701186417      0t0  TCP my.server.net:49299->sparrow.telecommunity.com:svn (ESTABLISHED)
svn     28127    user    6u  IPv4 701186439      0t0  TCP my.server.net:49309->sparrow.telecommunity.com:svn (ESTABLISHED)

(There appears to be an svn:externals property set to ez_setup svn://svn.eby-sarna.com/svnroot/ez_setup on the directory being updated; based on their website, I think this is redirecting to telecommunity.com)

Additional possibly-relevant points:

The Python environment on the Macs is 2.5. On the Debian boxes, it is 2.7.
I am not well versed in SVN, and I have no idea if the reason it is hanging is fundamentally the same thing or not. Nor am I entirely sure what the implications of svn:externals are; this was set up before my time.
The Python scripts themselves are retrieving large-ish (~10MB in some cases) chunks of data from Amazon S3, and this has a tendency to be slow (I'm seeing download times as long as three minutes, which seems long compared to how long it takes the servers -- even in different data centers -- to communicate with each other). Similarly, some of our SVN repositories are rather large. So that's basically to say that some of these operations are long-running anyway, however they also seem to get hung for hours or days in some cases.
On one server, the OOM killer took out MySQL this morning. On closer inspection, memory usage was at 90%, and swap usage was at 100% (as reported by Monit); killing a large backlog of hung Python jobs reduced these stats to 60% and 40% respectively. This gives me the impression that at least some (if not all) of the data is being downloaded/read (and held in memory while the process hangs).
These cron jobs are requesting a list of resources from S3, and updating a list of MySQL tables accordingly. Each job is started with the same list, so will try to request the same resources and update the same tables.
I was able to capture some traffic from one of the hung processes; it's all a bit inscrutable to me, but I wonder if it indicates that the connection is active and working, just very very slow? I've provided it as a gist, to avoid clutter (I should note that this is about two hours worth of capture): https://gist.github.com/petronius/286484766ad8de4fe20b This was a red herring, I think. There's activity on that port, but it's not the same connection as the one to S3 -- it's just other random server activity.
I've tried to re-create this issue on a box in another data center (a VM running the same version of Debian with the same system setup) with no luck (I'd thought that perhaps the issue was related to this one, but the boxes experiencing these issues aren't VMs, and there are no dropped packets according to ifconfig). I guess this indicates a network configuration issue, but I'm not sure where to start with that.

So I guess my questions are:

Can I fix this at a system level, or is this something going wrong with each individual process?
Is there something fundamentally different about how OS X and Linux handle read calls that I need to know to avoid infinitely-hanging processes?

Can I fix this at a system level, or is this something going wrong with each individual process?

It is difficult to say because its unknown what is happening at the protocol level. Basically the read(2) will block indefinitely providing:-

The TCP connection remains open.
You expect at least 1 byte of data to arrive.
The sender is not ready to send you data.

Now, it could be that something is wrong with the process, such as the other end is expecting a response from you first before sending more data, or a previous response from the other end anticipates SVN to do something else before requesting more data. Suppose for example an error response came back which should force the client to resend some information.

You cannot fix this gracefully because its impossible from the information you have to determine what the sender of this data is expecting you to do. However there are a few possible ways to avoid the problem and report it.

Rather than using wait in a simple blocking mode, run wait and configure an alarm in the parent process. Now, when the process fails to complete within a fixed period of time you can kill it and report this happened. A cheap way to do this is to alter the subprocess.Popen to call the timeout command.
Modify the read so that it sets a read timeout socket option. You can do this by altering the code, or - using an interposer to override the default socket system call to also add a timeout to the receiver. Both of which are not trivial. This may cause svn to behave in unexpected ways.

Is there something fundamentally different about how OS X and Linux handle read calls that I need to know to avoid infinitely-hanging processes?

I do not know the answer to this, however if both are behaving posixly correct, they should both behave the same way. If you attempt to read from a socket that is not yet prepared to send you data blocking the stream indefinitely is the anticipated behaviour.

Overall, I think your best choice of attack is to expect your svn command to have completed within a specific time period. If it does not kill it and report you did so.

I think I have figured out the issue(s) described above, and most of the mystery stems from my misunderstanding of what was happening on the servers.

There were the following fundamental problems:

Python scripts which should have had a timeout set (and which I assumed did) did not. Some of those hung indefinitely when connecting to S3, exhibiting the behavior of waiting indefinitely for a read to complete. Combing through the code and making sure that global socket timeouts were set, and not getting unset, seems to have solved that part.
Some of the old Python processes appeared to be hung, but on closer inspection (once the truly blocked processes were weeded out), they were simply listing large S3 buckets to check the status of keys in those buckets, and this operation was taking hours or days to complete.
The SVN checkout command was (still is) hanging for long periods of time when updating very large projects with many files in very deep directory structures. The client is waiting for a read to complete, but this is entirely legitimate (it appears to take the repository server a long time to collect the data it needs to send back to the client).

I'm leaving this answer here to explain what was going on, but I'm going to accept Matthew's, because he was right about what the actual possible issues were.

Processes hanging indefinitely when reading from network connections

Related

Recent Posts