Fence Sync patches

Sun Dec 5 20:31:24 PST 2010

On Fri, 2010-12-03 at 13:08 -0800, James Jones wrote:
> On Friday 03 December 2010 11:16:43 am Owen Taylor wrote:
> > On Fri, 2010-12-03 at 10:13 -0800, James Jones wrote:
> > > I wrote a slide deck on synchronization and presentation ideas for X a
> > > year ago or so before starting this work:
> > > 
> > > http://people.freedesktop.org/~aplattner/x-presentation-and-
> > > synchronization.pdf
> > > 
> > > Aaron presented it at XDevConf last year.  However, that doesn't really
> > > cover the immediately useful function for GL composite managers:
> > > XDamageSubtractAndTrigger().  I plan on putting some patches to compiz
> > > together demonstrating the usage, but basically the flow is:
> > > 
> > > -Create a bunch of sync objects at startup time in your GL/GLES-based
> > > compositor, and import them into your GL context as GL sync objects. 
> > > I'll call those syncObjectsX[] and syncObjectsGL[] respectively.
> > > 
> > > -Rather than calling XDamageSubtract(), call
> > > XDamageSubtractAndTrigger(syncObjectsX[current]).
> > 
> > So the basic flow here is:
> > 
> >  Client => X server     [render this]
> >  X server => GPU        [render this]
> >  X server => compositor [something was rendered]
> >  compositor => xserver  [trigger the fence]
> >  compositor => GPU      [render this after the fence]
> >  xserver => GPU         [trigger the fence]
> 
> Roughly, but I think that ordering implies a very worst-case scenario.  In 
> reality the last two steps will most likely occur simultaneously, and the last 
> step might even be a no-op: If the X server already knows the rendering has 
> completed it can simply mark the fence triggered immediately without going out 
> to the GPU.

Obviously exactly how fences work is going to be very hardware dependent.
But it doesn't seem to me that marking the fence triggered is ever a "no-op" - 
until it is done, the GPU won't progress past the wait on the fence. So in
order to get the frame to the screen, the X server is going to have to be
scheduled again and process the compositor's request. If the X server 
is busy doing something else, there might measurable latency.

>   This is often the case in our driver, though I haven't 
> implemented that particular optimization yet.
> 
> > In the normal case where there is a single damage event per frame, the
> > fact that we have this round trip where the compositor has to go back to
> > the X server, and the X server has to go back to the GPU bothers me.
> 
> I like to point out that it's not really a round trip, but rather two trips to 
> the same destination in parallel.  A round trip would add more latency.

It's actually possible that a round-trip would add *less* latency, because in
the round trip case the compositor will yield its timeslice to the X server, while
simply writing stuff to the X socket won't do that. In the lightly-loaded multicore
case, you will, of course, get the parallelism you mentioned... the X server will
wake up and handle the async request as soon as it receives it.

> > It's perhaps especially problematic in the case of the open source
> > drivers where the synchronization is already handled correctly without
> > this extra work and the extra work would just be a complete waste of
> > time. [*]
> 
> The default implementation assumes the Open Source driver behavior and marks 
> the fence triggered as soon as the server receives the request, so the only 
> added time will be a single asynchronous X request if the open source OpenGL-
> side implementation is done efficiently.

Well, the default implementation is at least going to have to flush buffers,
since the Open Source driver behavior only applies to buffers submitted
to the kernel. 

I'm also thinking that it's making stronger assumptions about the guarantees that
the open source drivers than are being assumed by the current flush-after-damage. 
It's assuming equivalence to full serialization of submitted buffers - also
for things like:

 process A: write to B1, fence F1
 process B: wait for F1, write to B1

which is considerably stronger than an implicit fence between writing to a buffer
and turning around and reading from it. (Maybe all the open source drivers actually
fit the stronger criteria ... haven't checked.)

> > But it doesn't seem like a particularly efficient or low-latency way of
> > handling things even in the case of a driver with no built in
> > synchronization.
> > 
> > Can you go into the reasoning for this approach?
> 
> As I said, this definitely isn't the ideal approach, it's the best fully 
> backwards compatible approach we could come up with.  Things we considered:
> 
> -Add a GL call to wait on the GPU for the damage event sequence number.  We 
> got bogged down here worrying about wrapping of 32-bit values, the lack of 
> ability to do a "wait for >=" on a 64-bit values on GPUs, and the discussion 
> rat-holed.  This was discussed on IRC so long ago I don't even remember all 
> the pros/cons.

Something like this was the first thing that came to mind (well, not the
sequence number, since those are per-client, but say associating an XSync
counter with the damage object.) I don't think it's very hard to work around
the 64-bit issue, at least for >=, but I can understand the reluctance to
spec out something new going beyond current GL synchronization primitives
and assume that it will be implementable on all hardware.

[ This is the basic advantage that the implicit fence approach has - since
  the fences are being inserted at the driver level, they can be done however
  works out right for that hardware without needing a common abstraction ]

> -Have the server generate a sync object ID, trigger it, and send that with the 
> damage event if clients opt-in some how.  This seemed very anti-X design 
> (clients should create, or at least name, transient resources), and has the 
> potential of generating tons and tons of objects if the client forgets to 
> delete them or can't keep up with the damage events.  Also, importing X sync 
> objects to GL is expensive, so it's desirable to make that a startup-time 
> operation.
>
> -Have the client provide the ringbuffer of objects to X and have it figure out 
> which one to trigger on every damage event.  I don't think I ever discussed 
> this with anyone.  I dismissed it as hiding too much magic in X.

Yeah, I don't think these are workable. In addition to the issues you mention
there's the basic problem that the compositor would need to use
DamageReportRawRectangles since in the other damage modes you don't necessarily
get an event for the *last* damage. So that makes anything where a separate
sync object is used for each damage report - whether newly created or in a
ring buffer - much worse. (The need for DamageReportRawRectangles also applies
to the counter version.)

I don't really have a better suggestion for integrating explicit two-state glWaitSync()
style fence objects with damage. Essentially, it's going to require grouping
damage into coherent frames to be workable. The X server can't do the grouping
because there is no frame grouping in the protocol. Your proposal has the compositor
do the grouping. This requires the round-trip described above, and does increase
latency but doesn't require application modification. I think that's fine if we
actually know what we are doing to modify apps - if we can fix up GTK+ and Qt and
OpenGL, then a little extra latency for Motif apps - who cares.

But I can't say that I'm at all happy the idea that we'll have two sets of drivers, one
where flushing rendering enables an implicit fence for subsequent rendering from
that buffer, and one where it doesn't. If the implicit fence approach is actually
less efficient... work being pointlessly done, then I say we should just kill it
everywhere and move to a consistently looser model.

> > [...]
> > 
> > > I admit this isn't an ideal work-flow, and yes it is one more layer of
> > > hard- to-test voodoo needed to write a robust TFP/EGLimage based
> > > composite manager, but it's the best we can do without modifying client
> > > applications.  However, fence sync objects provide a basis for all kinds
> > > of cooler stuff once you start defining new ways that client
> > > applications can use them to notify the composite manager when they've
> > > finished rendering a frame explicitly.  Then the extra step of telling X
> > > you want notification when some rendering you've already been notified
> > > of has completed will go away.  The rendering notification (damage
> > > event) and a pointer of some sort to the sync object that tracks it will
> > > arrive together.  That's what I'll be working on after the initial
> > > object support is wrapped up.
> > 
> > It worries me to see a pretty complex, somewhat expensive band-aid going
> > in *without* knowing more about that long term picture. Obviously if the
> > fence objects are useful for other things, then that reduces the
> > complexity of the band-aid a bit.
> 
> While I don't have the code changes that rely on this change ready for the 
> "cooler stuff" yet, one such future application, multi-buffering is discussed 
> in the second half of the PDF link I sent in the last response.  I understand 
> there is some hesitance to reintroduce a multi-buffered approach to rendering 
> in X when the previous multibuffer extension was mostly (completely?) unused 
> and it can bloat memory footprints, but I do think layering it on top of 
> composite and taking into account the efficiency gained by offloading the 
> buffer swapping to composite managers makes it a lot more interesting.  

Note that the same behaviors that make fences unnecessary for Damage for the
open source drivers makes them also unnecessary for any sort of multibuffering.
Once the app is done rendering a frame, it flushes to the kernel before
"submitting" the buffer.

> Multi-buffering also allows true tear-free rendering in X.  Right now, composite 
> provides double buffering, but doesn't eliminate all tearing because 
> applications can be asynchronously rendering to the composite backing buffer 
> while the composite manager is texturing from it.

I suppose tearing is possible on hardware with multiple simultaneous
execution units. But it's not really the most critical problem. The most
critical problem is a) providing mechanisms for X apps to integrate with
the compositor redraw cycle b) hooking up the synchronization mechanisms
of OpenGL to the compositor redraw cycle instead of to the VBlank.

If the compositor only redraws when it has new frames, and apps only
draw new frames after the last frame has rendered, then you reduce a lot
of the possibility for bad interactions.

> Applications eliminate most 
> of that by doing their own double-buffering: They allocate a window-size 
> pixmap, render to it, then blit it all to the window/composite backing buffer 
> at once.

I really don't know any apps that do that. Apps almost always generate a pixmap
the size of the update region. Doing partial updates is pretty essential for
power consumption, and I think whatever we do for frame-synced output
needs to respect that.

[...]

> Fence syncs can also be used as a more powerful, API-agnostic version of 
> glXWaitX()/glXWaitGL.  While glXWaitX() waits for X rendering on a particular 
> display to complete before allowing GL rendering on a particular context to 
> continue (and vice-versa for glXWaitGL()), fence sync objects can operate 
> across any asynchronous rendering stream on an X screen.  A multi-threaded 
> client with one display per thread, one for X, one for GL, could synchronize 
> the two using fence sync objects.

Is this a common need?

> In general I believe explicit back-end synchronization objects are a powerful 
> tool.  I don't doubt there are more uses for them out there than I can 
> enumerate at this time.

Not completely sold on the idea here ... I still think that it's being rushed
in without understanding how it fits in with more important stuff like frame
synchronization, but I'm just an outside observer. I can add the code to 
GNOME 3 as needed.

[ In terms of GNOME 3 and NVIDIA: If it's *that* slow to update clip lists for a
  GLX window, then just save the last one you got, and during ValidateTree
  memcmp() and if nothing  changed, don't do anything. Can't be more than 20
  lines of code. Would make thousands of GNOME 3 users happy ]

> > [*] If it was documented that a Damage event didn't imply the rendering
> > had hit the GPU, then the X server could be changed not to flush
> > rendering before sending damage events. In the normal case where the
> > rendering is just a single glXSwapBuffers() or XCopyArea() that doesn't
> > actually improve efficiency, but it does slightly reduce extra work from
> > the fence. On the other hand, that would change this exercise from
> > "fixing a corner case that misrenders on one driver" to "breaking every
> > non-updated compositing manager".
> 
> Its been noted several times that X damage events only guarantee subsequent X 
> rendering (and by extension, any rendering of extensions that are defined to 
> occur in-band with X rendering, which GLX explicitly does not guarantee) will 
> happen after the damage has landed, and my updates to the damage documentation 
> explicitly document this.

Does interpretations of the spec really matter here? I'm not aware of
any compositing manager that called glXWaitX and even if they did, the
spec for glXWaitX says:

 X rendering calls made prior to glXWaitX are guaranteed to be executed
 before OpenGL rendering calls made after glXWaitX. While the same
 result can be achieved using XSync, glXWaitX does not require a round
 trip to the server, and may therefore be more efficient.

Implying that the X rendering calls are "executed before OpenGL
rendering calls" in only the weak form implied by XSync() or reception
of Damage events. It would be pointless to XSync() on the reception of a
Damage event, so it's pointless to glXWaitX()?

We can say that the combination of TFP and Damage was broken, and there
was no way to write a correct compositing manager using TFP, but we
can't say that they were correct but there was no way to write a correct
compositing manager using TFP, since that's just silly. The only point
of TFP is to write compositing managers.

Anyways, that's just an argument against worrying too much about what
the specs say instead of how we want to move forward.

- Owen