Processes hanging indefinitely when reading from network connections
An update to the below:
I have experienced a similar problem on an unrelated script, on a Debian virtual machine in a different datacentre.
This looks suspiciously like the issue described here (and like the person asking that question, I do not have a proxy configured in front of the server).
The main difference to the description below is that when I attach to the hung process, I see a call to recvfrom
rather than read
:
$ strace -p 17527
Process 17527 attached - interrupt to quit
recvfrom(3,
However Python is not under any impression that it is being proxied:
>>> import os; print os.getenv("HTTP_PROXY"), os.getenv("http_proxy")
None, None
So I'm still stumped. Sadly, the linked question is without a final answer as well.
(I'm also wondering if this question is related, but it seems unlikely that S3 would be failing to honor Connection: close
headers.)
I have several Debian (Wheezy, x86_64) servers all exhibiting the following behavior:
All of the servers have a set of cron jobs that, among other things, pull data from S3. These usually run fine, but occasionally ps aux
reveals that some of the jobs started hours or days ago are still running, and have not finished cleanly.
Inspecting them with strace -p <pid>
shows, in all cases, the process is hung on a read command. For example, the output of a process I checked just now was:
$ strace -p 12089
Process 12089 attached - interrupt to quit
read(5,
And checking the open file descriptors gives me this:
$ sudo lsof -i | grep 12089
python 12089 user 5u IPv4 809917771 0t0 TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)
At first I assumed this was just due to a lack of setting a read timeout in the Python scripts, but this does not appear to be the case, for a couple of reasons:
- This doesn't happen when the same jobs are run on our OS X boxes (all 10.5, i386), using identical code.
- A variant of the script that does set a timeout (of 60 seconds, using
socket.setdefaulttimeout
-- this is in Python 2.7, but the code base has to be 2.5 compatible) has been hung since yesterday. - Another process that isn't Python seems to occasionally exhibit similar behaviour. In this case, a Python script is executing an
svn up --non-interactive
process (usingsubprocess.Popen
, for what that's worth).
The situation with that SVN process is similar--
Python is waiting for SVN:
$ strace -p 28034
Process 28034 attached - interrupt to quit
wait4(28127,
And SVN is waiting for a read
call to complete:
$ strace -p 28127
Process 28127 attached - interrupt to quit
read(6,
And that read is pointing to another external host:
$ sudo lsof -i | grep 28127
svn 28127 user 3u IPv4 701186417 0t0 TCP my.server.net:49299->sparrow.telecommunity.com:svn (ESTABLISHED)
svn 28127 user 6u IPv4 701186439 0t0 TCP my.server.net:49309->sparrow.telecommunity.com:svn (ESTABLISHED)
(There appears to be an svn:externals
property set to ez_setup svn://svn.eby-sarna.com/svnroot/ez_setup
on the directory being updated; based on their website, I think this is redirecting to telecommunity.com)
Additional possibly-relevant points:
- The Python environment on the Macs is 2.5. On the Debian boxes, it is 2.7.
- I am not well versed in SVN, and I have no idea if the reason it is hanging is fundamentally the same thing or not. Nor am I entirely sure what the implications of
svn:externals
are; this was set up before my time. - The Python scripts themselves are retrieving large-ish (~10MB in some cases) chunks of data from Amazon S3, and this has a tendency to be slow (I'm seeing download times as long as three minutes, which seems long compared to how long it takes the servers -- even in different data centers -- to communicate with each other). Similarly, some of our SVN repositories are rather large. So that's basically to say that some of these operations are long-running anyway, however they also seem to get hung for hours or days in some cases.
- On one server, the OOM killer took out MySQL this morning. On closer inspection, memory usage was at 90%, and swap usage was at 100% (as reported by Monit); killing a large backlog of hung Python jobs reduced these stats to 60% and 40% respectively. This gives me the impression that at least some (if not all) of the data is being downloaded/read (and held in memory while the process hangs).
- These cron jobs are requesting a list of resources from S3, and updating a list of MySQL tables accordingly. Each job is started with the same list, so will try to request the same resources and update the same tables.
-
I was able to capture some traffic from one of the hung processes; it's all a bit inscrutable to me, but I wonder if it indicates that the connection is active and working, just very very slow? I've provided it as a gist, to avoid clutter (I should note that this is about two hours worth of capture): https://gist.github.com/petronius/286484766ad8de4fe20bThis was a red herring, I think. There's activity on that port, but it's not the same connection as the one to S3 -- it's just other random server activity. - I've tried to re-create this issue on a box in another data center (a VM running the same version of Debian with the same system setup) with no luck (I'd thought that perhaps the issue was related to this one, but the boxes experiencing these issues aren't VMs, and there are no dropped packets according to
ifconfig
). I guess this indicates a network configuration issue, but I'm not sure where to start with that.
So I guess my questions are:
- Can I fix this at a system level, or is this something going wrong with each individual process?
- Is there something fundamentally different about how OS X and Linux handle
read
calls that I need to know to avoid infinitely-hanging processes?
Can I fix this at a system level, or is this something going wrong with each individual process?
It is difficult to say because its unknown what is happening at the protocol level. Basically the read(2)
will block indefinitely providing:-
- The TCP connection remains open.
- You expect at least 1 byte of data to arrive.
- The sender is not ready to send you data.
Now, it could be that something is wrong with the process, such as the other end is expecting a response from you first before sending more data, or a previous response from the other end anticipates SVN to do something else before requesting more data. Suppose for example an error response came back which should force the client to resend some information.
You cannot fix this gracefully because its impossible from the information you have to determine what the sender of this data is expecting you to do. However there are a few possible ways to avoid the problem and report it.
- Rather than using
wait
in a simple blocking mode, runwait
and configure an alarm in the parent process. Now, when the process fails to complete within a fixed period of time you can kill it and report this happened. A cheap way to do this is to alter the subprocess.Popen to call thetimeout
command. - Modify the read so that it sets a read timeout socket option. You can do this by altering the code, or - using an interposer to override the default
socket
system call to also add a timeout to the receiver. Both of which are not trivial. This may causesvn
to behave in unexpected ways.
Is there something fundamentally different about how OS X and Linux handle read calls that I need to know to avoid infinitely-hanging processes?
I do not know the answer to this, however if both are behaving posixly correct, they should both behave the same way. If you attempt to read from a socket that is not yet prepared to send you data blocking the stream indefinitely is the anticipated behaviour.
Overall, I think your best choice of attack is to expect your svn
command to have completed within a specific time period. If it does not kill it and report you did so.
I think I have figured out the issue(s) described above, and most of the mystery stems from my misunderstanding of what was happening on the servers.
There were the following fundamental problems:
- Python scripts which should have had a timeout set (and which I assumed did) did not. Some of those hung indefinitely when connecting to S3, exhibiting the behavior of waiting indefinitely for a read to complete. Combing through the code and making sure that global socket timeouts were set, and not getting unset, seems to have solved that part.
- Some of the old Python processes appeared to be hung, but on closer inspection (once the truly blocked processes were weeded out), they were simply listing large S3 buckets to check the status of keys in those buckets, and this operation was taking hours or days to complete.
- The SVN checkout command was (still is) hanging for long periods of time when updating very large projects with many files in very deep directory structures. The client is waiting for a read to complete, but this is entirely legitimate (it appears to take the repository server a long time to collect the data it needs to send back to the client).
I'm leaving this answer here to explain what was going on, but I'm going to accept Matthew's, because he was right about what the actual possible issues were.