Debugging connection timeouts using strace?

I'm trying to figure out why is Eclipse spinning, so I decided to fire up strace.

I found the eclipse process using:

$ ps ax | grep java
 5546 ?        Sl    19:04 /usr/bin/java ... [arguments omitted]

By running strace on this process, I see that it's waiting on another process:

$ sudo strace -p 5546
Process 5546 attached - interrupt to quit
futex(0x7f6c416679d0, FUTEX_WAIT, 5547, NULL^C <unfinished ...>
Process 5546 detached

Interestingly, process 5547 doesn't appear in ps (can anyone tell me why?), but I can strace it. It is repeatedly spitting out a lot of EAGAIN failures (with the occasional success)

read(16, 0x7f6c41664d10, 16)            = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(15, 0x7f6c3815f2e4, 4096, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {67410, 357843264}) = 0
poll([{fd=16, events=POLLIN}, {fd=15, events=POLLIN}, {fd=68, events=POLLIN}, {fd=128, events=POLLIN}, {fd=69, events=POLLIN}], 5, 0) = 0 (Timeout)
read(16, 0x7f6c41664cb0, 16)            = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(15, 0x7f6c3815f2e4, 4096, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
...

Judging from the output, it looks like it is polling the file descriptors 16, 15, 68, 128 and 69. Particularly, the EAGAIN errors are coming from fds 15 and 6, as seen from the read(2) and recvfrom(2) calls.

How do I find more information on those fds? I tried lsof -p 5547 but no output is printed. I suspect these are sockets that are open to some website, but why it is spinning in a tight loop with EAGAIN failures is puzzling...


Solution 1:

Some PIDs are not listed because they belong to threads. htop can show them if you press Shift+H (and optionally T for tree view), but lsof wants the PID of the main process. (All pthreads in a process share file descriptors, anyway.) You can also take a look in /proc/5546/fd/ and /proc/5546/task/.

EAGAIN is normal for non-blocking I/O; for example, it is returned by read() when there's no data to read. See "ERRORS" in read(2), write(2) and so on. Some of these fd's are likely connections to the X11 server – non-blocking I/O is used by the X11 client libraries.

Solution 2:

The third parameter to futex(2) is not necessarily a process ID, the manual says it's futex(uaddr, op, val, timeout, ...), and that if op is FUTEX_WAIT, it "atomically verifies that the futex address uaddr still contains the value val, and sleeps awaiting FUTEX_WAKE on this futex address. [When] timeout is NULL, the call blocks indefinitely."

In your case, it sounds like the process 5546 was sitting there waiting for something -- presumably its thread 5547 or something related, but we can't know exactly what for sure based only on this -- to write that number at the specified memory address.

Note also that you can see threads by using ps -eLf or similar.