weird Xwayland and compositor deadlock issue [WAS: [PATCH xserver v2] xwayland: handle EAGAIN and EINTR gracefully]

Tue Sep 13 11:34:36 UTC 2016

On Tue, 13 Sep 2016 06:13:16 -0400 (EDT)
Olivier Fourdan <ofourdan at redhat.com> wrote:

> Hi all
> 
> ----- Original Message -----
> > wl_display_flush() can fail with EAGAIN and Xwayland would make this a
> > fatal error.
> > 
> > Handle the usual EAGAIN and EINTR gracefully so that Xwayland doesn't
> > die for so little.  
> 
> Right, I am running out of ideas...
> 
> So the approach of using poll() to wait for the Wayland file descriptor to become writeable again leads straight to a deadlock apparently...
> 
> Reason for this is the compositor (gnome-shell/mutter) is itself waiting for data on the X file descriptor:
> 
> Backtrace of gnome-shell while we hit the EAGAIN case on the Wayland fd on the Xwayland side:
> 
> #0  0x00007f86d1cd400d in poll () at /lib64/libc.so.6
> #1  0x00007f86d1537d10 in _xcb_conn_wait () at /lib64/libxcb.so.1
> #2  0x00007f86d1539aa9 in xcb_wait_for_event () at /lib64/libxcb.so.1
> #3  0x00007f86d21fe03b in _XReadEvents (dpy=dpy at entry=0x55f956633000) at xcb_io.c:401
> #4  0x00007f86d21e562e in XIfEvent (dpy=0x55f956633000, event=0x7ffe30c28eb0, predicate=<find_timestamp_predicate>, arg=0x55f956761100)
>     at IfEvent.c:68
> #5  0x00007f86d8031ddb in meta_display_get_current_time_roundtrip () at /lib64/libmutter.so.0
> #6  0x00007f86d805ac49 in handle_other_xevent () at /lib64/libmutter.so.0
> #7  0x00007f86d805b95b in xevent_filter () at /lib64/libmutter.so.0
> #8  0x00007f86d73b98f1 in gdk_event_apply_filters () at /lib64/libgdk-3.so.0
> #9  0x00007f86d73b9cf2 in _gdk_x11_display_queue_events () at /lib64/libgdk-3.so.0
> #10 0x00007f86d7380f19 in gdk_display_get_event () at /lib64/libgdk-3.so.0
> #11 0x00007f86d73b9962 in gdk_event_source_dispatch () at /lib64/libgdk-3.so.0
> #12 0x00007f86d37d0f22 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #13 0x00007f86d37d12a0 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
> #14 0x00007f86d37d15c2 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> #15 0x00007f86d803c00c in meta_run () at /lib64/libmutter.so.0
> #16 0x000055f953220657 in main ()
> 
> i.e gnome-shell is stuck in meta_display_get_current_time_roundtrip():
> 
>   https://git.gnome.org/browse/mutter/tree/src/core/display.c#n1300
> 
> While at the same time, Xwayland is trying to write to the Wayland file descriptor with wl_display_flush() and gets an EAGAIN in the block_handler():
> 
>   https://cgit.freedesktop.org/xorg/xserver/tree/hw/xwayland/xwayland.c?h=server-1.18-branch#n483
> 
> I tried to poll() the Wayland fd with a timeout prior to wl_display_flush() to make sure to wl_display_flush() only when writable, to see if that would help unblocking mutter waiting for its PropertyNotify event but that did not work, the Wayland fd still remains in EAGAIN forever and gnome-shell/mutter remains stuck waiting for the PropertyNotify event...
> 
> I am a bit puzzled, why is gnome-shell/mutter/xcb waiting for the PropertyNotify, where is that event gone?

Hi Olivier,

I don't have any solution for you. The interactions between the Wayland
compositor and Xwayland are known to be very easily deadlockable IIRC. I
believe the only thing you can do is ensure no such case can ever
occur, which is very painful. That is, never do a blocking roundtrip at
least from one side.

Have the recent modifications caused a significant increase of Wayland
requests from Xwayland? If Xwayland needs to send an amount of data
bigger than bufferable, *any* blocking roundtrip via X11 from the
Wayland compositor is prone to deadlock. It will be waiting for a reply
via X11, while Xwayland is blocked on flushing, since the Wayland
compositor is not consuming requests.

It can also trivially happen if both sides do a blocking roundtrip at
the same time. Or just a wait for an event.

Either server needs to be able to return to its main loop to process the
protocol stream it is the server for. Preferably both, I think.

You could check how Weston's XWM works. I highly suspect that after
Xwayland launch it avoids doing any blocking roundtrips via X11.

I'd assume Xwayland also tries to avoid blocking on Wayland events, but
if nothing else, I believe Mesa via GLAMOR may block on
wl_buffer.release events... or maybe not if GLAMOR is smart with its
throttling. Anyway, since your flush is hitting EAGAIN, that doesn't
seem to be the cause.

I wonder if making wl_display_flush() block immediately like in your
patch could be replaced by adding the wl_display fd to the main poll
loop, so that it would get flushed ASAP but still service X11 requests
in the mean time? It does run the risk of overflowing the Wayland send
buffer in Xwayland. Any way to prioritize the Wayland compositor's X11
connection in Xwayland?

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 811 bytes
Desc: OpenPGP digital signature
URL: <https://lists.x.org/archives/xorg-devel/attachments/20160913/962c8c97/attachment.sig>