[PATCH libX11 0/2] Xlib 32-bit sequence number wrap bug(fix)

Sat Nov 16 14:30:16 PST 2013

cc'ing the xcb mailing list, since more people there know about how the
Xlib/xcb interaction in xcb_io works.   For the people there who didn't
see the patches on xorg-devel, you can find them at:

http://patchwork.freedesktop.org/patch/15501/
http://patchwork.freedesktop.org/patch/15500/

On 11/16/13 01:37 PM, Jonas Petersen wrote:
> Hi,
>
> I'm now submitting this the third time. I think it's an important issue and
> it needs to be reviewed. Please someone take a look at it, there's basically
> just 3 lines of code involved.
>
> If anybody wants more information, code to reproduce the failure, or changes
> in the patch I submitted, please tell me.
>
> --
> Here's two patches. The first one fixes a 32-bit sequence wrap bug. The
> second patch only adds a comment to another relevant statement.
>
> The patches contain some details. Here is the whole story for who might be
> interested:
>
> Xlib (libx11) will crash an application with a "Fatal IO error 11 (Resource
> temporarily unavailable)" after 4 294 967 296 requests to the server. That is
> when the Xlib internal 32-bit sequence wraps.
>
> Most applications probably will hardly reach this number, but if they do,
> they have a chance to die a mysterious death. For example the application I'm
> working on did always crash after about 20 hours when I started to do some
> stress testing. It does some intensive drawing through Xlib using gktmm2,
> pixmaps and gc drawing at 40 frames per second in full hd resolution (on
> Ubuntu). Some optimizations did extend the grace to about 35 hours but it
> would still crash.
>
> What then followed was some frustrating weeks of digging and debugging to
> realize that it's not in my application, nor in gtkmm, gtk or glib but that
> it's this little bug in Xlib which exists since 2006-10-06 apparently.
>
> It took a while to turn out that the number 0x100000000 (2^32) has some
> relevance. (Much) later it turned out it can be reproduced with Xlib only,
> using this code for example:
>
>    while(1) {
>        XDrawPoint(display, drawable, gc, x, y);
>        XFlush(display);
>    }
>
> It might take one or two hours, but when it reaches the 4294 million it will
> explode into a "Fatal IO error 11".
>
> What I then learned is that even though Xlib uses internal 32bit sequence
> numbers they get (smartly) widened to 64bit in the process so that the 32bit
> sequence may wrap without any disruption in the widened 64bit sequence.
> Obviously there must be something wrong with that.
>
> The Fatal IO error is issued in _XReply() when it's not getting a reply where
> there should be one, but the cause is earlier in _XSend() in the moment when
> the Xlib 32-bit sequence number wraps.
>
> The problem is that when it wraps to 0, the value of 'last_flushed' will
> still be at the upper boundary (e.g. 0xffffffff). There is two locations in
> _XSend() (xcb_io.c) that fail in this state because they rely on those values
> being sequential all the time, the first location is:
>
>    requests = dpy->request - dpy->xcb->last_flushed;
>
> I case of request = 0x0 and last_flushed = 0xffffffff it will assign
> 0xffffffff00000001 to 'requests' and then to XCB as a number (amount) of
> requests. This is the main killer.
>
> The second location is this:
>
>    for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; \
>      ++sequence)
>
> I case of request = 0x0 (less than last_flushed) there is no chance to enter
> the loop ever and as a result some requests are ignored.
>
> The solution is to "unwrap" dpy->request at these two locations and thus
> retain the sequence related to last_flushed.
>
>    uint64_t unwrapped_request = ((uint64_t)(dpy->request < \
>      dpy->xcb->last_flushed) << 32) + dpy->request;
>
> It creates a temporary 64-bit request number which has bit 8 set if 'request'
> is less than 'last_flushed'. It is then used in the two locations instead of
> dpy->request.
>
> I'm not sure if it might be more efficient to use that statement inplace,
> instead of using a variable.
>
> There is another line in require_socket() that worried me at first:
>
>    dpy->xcb->last_flushed = dpy->request = sent;
>
> That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to 32-bit
> when assinging it to 'request' and then also assign the truncated value to
> the (64-bit) 'last_flushed'.
> But it seems inteded. I have added a note explaining that for the next poor
> soul debugging sequence issues... :-)
>
> - Jonas
>
> Jonas Petersen (2):
>    xcb_io: Fix Xlib 32-bit request number wrapping
>    xcb_io: Add comment explaining a mixed type double assignment
>
>   src/xcb_io.c |   14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
>

-- 
	-Alan Coopersmith-              alan.coopersmith at oracle.com
	 Oracle Solaris Engineering - http://blogs.oracle.com/alanc