_XReply() terminates app with _XIOError()

This is likely a known issue in libX11 regarding the handling of request numbers used for xcb_wait_for_reply.

At some point after libxcb v1.5 code to use 64-bit sequence numbers internally everywhere was introduced and logic was added to widen sequence numbers on entry to those public APIs that still take 32-bit sequence numbers.

Here is a quote from submitted libxcb bug report (actual emails removed):

We have an application that does a lot of XDrawString and XDrawLine. After several hours the application is exited by an XIOError.

The XIOError is called in libX11 in the file xcb_io.c, function _XReply. It didn't get a response from xcb_wait_for_reply.

libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to this commit:

commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey Sharp Date: Sat Oct 9 17:13:45 2010 -0700
xcb_in: Use 64-bit sequence numbers internally everywhere.

Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.

Signed-off-by: Jamey Sharp <[email protected]>
Reverting it on top of 1.8.1 helps.

Adding traces to libxcb I found that the last request numbers used for xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in the while loop of the _XReply function), half a second later: 63215 (then XIOError is called). The widen_request is also 63215, I would have expected 63215+2^32. Therefore it seems that the request is not correctly widened.

The commit above also changed the compares in poll_for_reply from XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening never worked correctly, but it was never observed, because only the lower 32bits were compared.

Reproducing the issue

Here's the original code snippet from the submitted bug report which was used to reproduce the issue:

  for(;;) {
    XDrawLine(dpy, w, gc, 10, 60, 180, 20);
    XFlush(dpy);
  }

and apparently the issue can be reproduced with even simpler code:

 for(;;) {
    XNoOp(dpy);
  }

According to submitted libxcb bug report these conditions are needed to reproduce (assuming the reproduce code is in xdraw.c):

libxcb >= 1.8 (i.e. includes the commit ed37b08)

compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c

the sequence counter wraps.

Proposed patch

The proposed patch which can be applied on top of libxcb 1.8.1 is this:

diff --git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
@@ -454,7 +454,7 @@ void _XSend(Display *dpy, const char *data, long size)
        static const xReq dummy_request;
        static char const pad[3];
        struct iovec vec[3];
-       uint64_t requests;
+       unsigned long requests;
        _XExtension *ext;
        xcb_connection_t *c = dpy->xcb->connection;
        if(dpy->flags & XlibDisplayIOError)
@@ -470,7 +470,7 @@ void _XSend(Display *dpy, const char *data, long size)
        if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
        {
                uint64_t sequence;
-               for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+               for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
                        append_pending_request(dpy, sequence);
        }
        requests = dpy->request - dpy->xcb->last_flushed;

Detailed technical explanation

Plase find bellow included detailed technical explanation by Jonas Petersen (also included in the aforementioned bug report):

Hi,

Here's two patches. The first one fixes a 32-bit sequence wrap bug. The second patch only adds a comment to another relevant statement.

The patches contain some details. Here is the whole story for who might be interested:

Xlib (libx11) will crash an application with a "Fatal IO error 11 (Resource temporarily unavailable)" after 4 294 967 296 requests to the server. That is when the Xlib internal 32-bit sequence wraps.

Most applications probably will hardly reach this number, but if they do, they have a chance to die a mysterious death. For example the application I'm working on did always crash after about 20 hours when I started to do some stress testing. It does some intensive drawing through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per second in full hd resolution (on Ubuntu). Some optimizations did extend the grace to about 35 hours but it would still crash.

What then followed was some frustrating weeks of digging and debugging to realize that it's not in my application, nor in gtkmm, gtk or glib but that it's this little bug in Xlib which exists since 2006-10-06 apparently.

It took a while to turn out that the number 0x100000000 (2^32) has some relevance. (Much) later it turned out it can be reproduced with Xlib only, using this code for example:

while(1) { XDrawPoint(display, drawable, gc, x, y); XFlush(display); }

It might take one or two hours, but when it reaches the 4294 million it will explode into a "Fatal IO error 11".

What I then learned is that even though Xlib uses internal 32bit sequence numbers they get (smartly) widened to 64bit in the process so that the 32bit sequence may wrap without any disruption in the widened 64bit sequence. Obviously there must be something wrong with that.

The Fatal IO error is issued in _XReply() when it's not getting a reply where there should be one, but the cause is earlier in _XSend() in the moment when the Xlib 32-bit sequence number wraps.

The problem is that when it wraps to 0, the value of 'last_flushed' will still be at the upper boundary (e.g. 0xffffffff). There is two locations in _XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:

requests = dpy->request - dpy->xcb->last_flushed;

I case of request = 0x0 and last_flushed = 0xffffffff it will assign 0xffffffff00000001 to 'requests' and then to XCB as a number (amount) of requests. This is the main killer.

The second location is this:

for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; \ ++sequence)

I case of request = 0x0 (less than last_flushed) there is no chance to enter the loop ever and as a result some requests are ignored.

The solution is to "unwrap" dpy->request at these two locations and thus retain the sequence related to last_flushed.

uint64_t unwrapped_request = ((uint64_t)(dpy->request < \ dpy->xcb->last_flushed) << 32) + dpy->request;

It creates a temporary 64-bit request number which has bit 8 set if 'request' is less than 'last_flushed'. It is then used in the two locations instead of dpy->request.

I'm not sure if it might be more efficient to use that statement inplace, instead of using a variable.

There is another line in require_socket() that worried me at first:

dpy->xcb->last_flushed = dpy->request = sent;

That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to 32-bit when assinging it to 'request' and then also assign the truncated value to the (64-bit) 'last_flushed'. But it seems inteded. I have added a note explaining that for the next poor soul debugging sequence issues... :-)

Jonas

Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping xcb_io: Add comment explaining a mixed type double assignment

src/xcb_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)

-- 1.7.10.4