Fence Sync patches

Fri Dec 3 13:08:17 PST 2010

On Friday 03 December 2010 11:16:43 am Owen Taylor wrote:
> On Fri, 2010-12-03 at 10:13 -0800, James Jones wrote:
> > I wrote a slide deck on synchronization and presentation ideas for X a
> > year ago or so before starting this work:
> > 
> > http://people.freedesktop.org/~aplattner/x-presentation-and-
> > synchronization.pdf
> > 
> > Aaron presented it at XDevConf last year.  However, that doesn't really
> > cover the immediately useful function for GL composite managers:
> > XDamageSubtractAndTrigger().  I plan on putting some patches to compiz
> > together demonstrating the usage, but basically the flow is:
> > 
> > -Create a bunch of sync objects at startup time in your GL/GLES-based
> > compositor, and import them into your GL context as GL sync objects. 
> > I'll call those syncObjectsX[] and syncObjectsGL[] respectively.
> > 
> > -Rather than calling XDamageSubtract(), call
> > XDamageSubtractAndTrigger(syncObjectsX[current]).
> 
> So the basic flow here is:
> 
>  Client => X server     [render this]
>  X server => GPU        [render this]
>  X server => compositor [something was rendered]
>  compositor => xserver  [trigger the fence]
>  compositor => GPU      [render this after the fence]
>  xserver => GPU         [trigger the fence]

Roughly, but I think that ordering implies a very worst-case scenario.  In 
reality the last two steps will most likely occur simultaneously, and the last 
step might even be a no-op: If the X server already knows the rendering has 
completed it can simply mark the fence triggered immediately without going out 
to the GPU.  This is often the case in our driver, though I haven't 
implemented that particular optimization yet.

> In the normal case where there is a single damage event per frame, the
> fact that we have this round trip where the compositor has to go back to
> the X server, and the X server has to go back to the GPU bothers me.

I like to point out that it's not really a round trip, but rather two trips to 
the same destination in parallel.  A round trip would add more latency.

> It's perhaps especially problematic in the case of the open source
> drivers where the synchronization is already handled correctly without
> this extra work and the extra work would just be a complete waste of
> time. [*]

The default implementation assumes the Open Source driver behavior and marks 
the fence triggered as soon as the server receives the request, so the only 
added time will be a single asynchronous X request if the open source OpenGL-
side implementation is done efficiently.

> But it doesn't seem like a particularly efficient or low-latency way of
> handling things even in the case of a driver with no built in
> synchronization.
> 
> Can you go into the reasoning for this approach?

As I said, this definitely isn't the ideal approach, it's the best fully 
backwards compatible approach we could come up with.  Things we considered:

-Add a GL call to wait on the GPU for the damage event sequence number.  We 
got bogged down here worrying about wrapping of 32-bit values, the lack of 
ability to do a "wait for >=" on a 64-bit values on GPUs, and the discussion 
rat-holed.  This was discussed on IRC so long ago I don't even remember all 
the pros/cons.

-Have the server generate a sync object ID, trigger it, and send that with the 
damage event if clients opt-in some how.  This seemed very anti-X design 
(clients should create, or at least name, transient resources), and has the 
potential of generating tons and tons of objects if the client forgets to 
delete them or can't keep up with the damage events.  Also, importing X sync 
objects to GL is expensive, so it's desirable to make that a startup-time 
operation.

-Have the client provide the ringbuffer of objects to X and have it figure out 
which one to trigger on every damage event.  I don't think I ever discussed 
this with anyone.  I dismissed it as hiding too much magic in X.

> > -Prefix all the GL rendering that repairs the damage subtracted with a
> > sync wait: glWaitSync(syncObjectsGL[current++])
> > 
> > The GL rendering will then wait (on the GPU.  It won't block the
> > application unless it gets really backed up) until all rendering that
> > created the damage has finished on the GPU.  Managing the ring-buffer of
> > sync objects is a little more complicated than that in practice, but
> > that's the basic idea.
> 
> Can you be more specific about that? Do you need to do a
> glClientWaitSync() when you wrap around and reuse the first sync object
> pair?

Yeah, that's about it.

> [...]
> 
> > I admit this isn't an ideal work-flow, and yes it is one more layer of
> > hard- to-test voodoo needed to write a robust TFP/EGLimage based
> > composite manager, but it's the best we can do without modifying client
> > applications.  However, fence sync objects provide a basis for all kinds
> > of cooler stuff once you start defining new ways that client
> > applications can use them to notify the composite manager when they've
> > finished rendering a frame explicitly.  Then the extra step of telling X
> > you want notification when some rendering you've already been notified
> > of has completed will go away.  The rendering notification (damage
> > event) and a pointer of some sort to the sync object that tracks it will
> > arrive together.  That's what I'll be working on after the initial
> > object support is wrapped up.
> 
> It worries me to see a pretty complex, somewhat expensive band-aid going
> in *without* knowing more about that long term picture. Obviously if the
> fence objects are useful for other things, then that reduces the
> complexity of the band-aid a bit.

While I don't have the code changes that rely on this change ready for the 
"cooler stuff" yet, one such future application, multi-buffering is discussed 
in the second half of the PDF link I sent in the last response.  I understand 
there is some hesitance to reintroduce a multi-buffered approach to rendering 
in X when the previous multibuffer extension was mostly (completely?) unused 
and it can bloat memory footprints, but I do think layering it on top of 
composite and taking into account the efficiency gained by offloading the 
buffer swapping to composite managers makes it a lot more interesting.  Multi-
buffering also allows true tear-free rendering in X.  Right now, composite 
provides double buffering, but doesn't eliminate all tearing because 
applications can be asynchronously rendering to the composite backing buffer 
while the composite manager is texturing from it.  Applications eliminate most 
of that by doing their own double-buffering: They allocate a window-size 
pixmap, render to it, then blit it all to the window/composite backing buffer 
at once.  However, that blit is wasteful when the composite manager is just 
going to then blit the contents to the screen.  All that's really needed in 
most cases is to switch which backing pixmap the composite manager textures 
from, but that needs to be supported in the composite protocol.

Even if the application doesn't want multiple buffers, it could use sync 
objects to properly mutex accesses to a single backing buffer with the 
composite manager.

Fence syncs can also be used as a more powerful, API-agnostic version of 
glXWaitX()/glXWaitGL.  While glXWaitX() waits for X rendering on a particular 
display to complete before allowing GL rendering on a particular context to 
continue (and vice-versa for glXWaitGL()), fence sync objects can operate 
across any asynchronous rendering stream on an X screen.  A multi-threaded 
client with one display per thread, one for X, one for GL, could synchronize 
the two using fence sync objects.

In general I believe explicit back-end synchronization objects are a powerful 
tool.  I don't doubt there are more uses for them out there than I can 
enumerate at this time.

> - Owen
> 
> [*] If it was documented that a Damage event didn't imply the rendering
> had hit the GPU, then the X server could be changed not to flush
> rendering before sending damage events. In the normal case where the
> rendering is just a single glXSwapBuffers() or XCopyArea() that doesn't
> actually improve efficiency, but it does slightly reduce extra work from
> the fence. On the other hand, that would change this exercise from
> "fixing a corner case that misrenders on one driver" to "breaking every
> non-updated compositing manager".

Its been noted several times that X damage events only guarantee subsequent X 
rendering (and by extension, any rendering of extensions that are defined to 
occur in-band with X rendering, which GLX explicitly does not guarantee) will 
happen after the damage has landed, and my updates to the damage documentation 
explicitly document this.

Thanks,
-James