Fence Sync patches

Tue Dec 7 16:54:12 PST 2010

On Sunday 05 December 2010 20:31:24 Owen Taylor wrote:
> On Fri, 2010-12-03 at 13:08 -0800, James Jones wrote:
> > On Friday 03 December 2010 11:16:43 am Owen Taylor wrote:
> > > On Fri, 2010-12-03 at 10:13 -0800, James Jones wrote:
> > > > I wrote a slide deck on synchronization and presentation ideas for X
> > > > a year ago or so before starting this work:
> > > > 
> > > > http://people.freedesktop.org/~aplattner/x-presentation-and-
> > > > synchronization.pdf
> > > > 
> > > > Aaron presented it at XDevConf last year.  However, that doesn't
> > > > really cover the immediately useful function for GL composite
> > > > managers: XDamageSubtractAndTrigger().  I plan on putting some
> > > > patches to compiz together demonstrating the usage, but basically
> > > > the flow is:
> > > > 
> > > > -Create a bunch of sync objects at startup time in your GL/GLES-based
> > > > compositor, and import them into your GL context as GL sync objects.
> > > > I'll call those syncObjectsX[] and syncObjectsGL[] respectively.
> > > > 
> > > > -Rather than calling XDamageSubtract(), call
> > > > XDamageSubtractAndTrigger(syncObjectsX[current]).
> > > 
> > > So the basic flow here is:
> > >  Client => X server     [render this]
> > >  X server => GPU        [render this]
> > >  X server => compositor [something was rendered]
> > >  compositor => xserver  [trigger the fence]
> > >  compositor => GPU      [render this after the fence]
> > >  xserver => GPU         [trigger the fence]
> > 
> > Roughly, but I think that ordering implies a very worst-case scenario. 
> > In reality the last two steps will most likely occur simultaneously, and
> > the last step might even be a no-op: If the X server already knows the
> > rendering has completed it can simply mark the fence triggered
> > immediately without going out to the GPU.
> 
> Obviously exactly how fences work is going to be very hardware dependent.
> But it doesn't seem to me that marking the fence triggered is ever a
> "no-op" - until it is done, the GPU won't progress past the wait on the
> fence. So in order to get the frame to the screen, the X server is going
> to have to be scheduled again and process the compositor's request. If the
> X server is busy doing something else, there might measurable latency.

Yes, the X server/driver needs to run.  Insert something here about even cell 
phones running dual-core processors, but I see your point.  It's not free.  
I'm just hoping it's close enough, because I don't like the alternatives.

> >   This is often the case in our driver, though I haven't
> > 
> > implemented that particular optimization yet.
> > 
> > > In the normal case where there is a single damage event per frame, the
> > > fact that we have this round trip where the compositor has to go back
> > > to the X server, and the X server has to go back to the GPU bothers
> > > me.
> > 
> > I like to point out that it's not really a round trip, but rather two
> > trips to the same destination in parallel.  A round trip would add more
> > latency.
> 
> It's actually possible that a round-trip would add *less* latency, because
> in the round trip case the compositor will yield its timeslice to the X
> server, while simply writing stuff to the X socket won't do that. In the
> lightly-loaded multicore case, you will, of course, get the parallelism
> you mentioned... the X server will wake up and handle the async request as
> soon as it receives it.

Yes, that could be true in some cases.  X clients can always XSync() if they 
find it delivers latency benefits.  Better to provide asynchronous requests and 
let smart applications decide what's best for them.

> > > It's perhaps especially problematic in the case of the open source
> > > drivers where the synchronization is already handled correctly without
> > > this extra work and the extra work would just be a complete waste of
> > > time. [*]
> > 
> > The default implementation assumes the Open Source driver behavior and
> > marks the fence triggered as soon as the server receives the request, so
> > the only added time will be a single asynchronous X request if the open
> > source OpenGL- side implementation is done efficiently.
> 
> Well, the default implementation is at least going to have to flush
> buffers, since the Open Source driver behavior only applies to buffers
> submitted to the kernel.
>
> I'm also thinking that it's making stronger assumptions about the
> guarantees that the open source drivers than are being assumed by the
> current flush-after-damage. It's assuming equivalence to full
> serialization of submitted buffers - also for things like:
> 
>  process A: write to B1, fence F1
>  process B: wait for F1, write to B1
> 
> which is considerably stronger than an implicit fence between writing to a
> buffer and turning around and reading from it. (Maybe all the open source
> drivers actually fit the stronger criteria ... haven't checked.)

My point was based on DamageSubtractAndTrigger().  That entry point could 
assume the trigger related to a specific set of damage for which 
synchronization was already handled via implicit fences and the damage event 
flush callback (I'm speculating based on following xorg-devel discussions.  I 
haven't verified in the code).

For the general XSyncTriggerFence() case, yes drivers will probably need to 
hook into the new driver API and at least flush.  As I say, I haven't read the 
actual driver source.

Going back to your original point about this adding overhead where 
synchronization is already handled internally, Keith suggested some ways to 
update the damage spec so that damage clients would know when extra 
synchronization is needed, so that shouldn't be a concern anymore.  I 
recognize maintaining two code paths is not ideal, but composite managers 
probably want to support older servers, at least for a while, so they would 
need two code paths for now anyway.

> > > But it doesn't seem like a particularly efficient or low-latency way of
> > > handling things even in the case of a driver with no built in
> > > synchronization.
> > > 
> > > Can you go into the reasoning for this approach?
> > 
> > As I said, this definitely isn't the ideal approach, it's the best fully
> > backwards compatible approach we could come up with.  Things we
> > considered:
> > 
> > -Add a GL call to wait on the GPU for the damage event sequence number. 
> > We got bogged down here worrying about wrapping of 32-bit values, the
> > lack of ability to do a "wait for >=" on a 64-bit values on GPUs, and
> > the discussion rat-holed.  This was discussed on IRC so long ago I don't
> > even remember all the pros/cons.
> 
> Something like this was the first thing that came to mind (well, not the
> sequence number, since those are per-client, but say associating an XSync
> counter with the damage object.) I don't think it's very hard to work
> around the 64-bit issue, at least for >=, but I can understand the
> reluctance to spec out something new going beyond current GL
> synchronization primitives and assume that it will be implementable on all
> hardware.
>
> [ This is the basic advantage that the implicit fence approach has - since
>   the fences are being inserted at the driver level, they can be done
> however works out right for that hardware without needing a common
> abstraction ]

On the contrary, I think this is a disadvantage.  With explicit, boolean 
fences, there's no need to worry about wrapping semantics and no need to use 
>= type logic.  The value waited for is True, the initial state is False.  The 
client specifies which fence to wait for and what conditions cause it to become 
Triggered, or True.

With the server issuing implicit fences, it has no way to know whether the 
client will ever bother to wait for the fence or a particular fence value.  
Therefore, it needs to be able to continuously, asynchronously update the 
value with the understanding that the client will wait for a particular fence 
value using range comparisons rather than equality comparisons, or generate an 
unbounded number of fences for the client to wait for.

> > -Have the server generate a sync object ID, trigger it, and send that
> > with the damage event if clients opt-in some how.  This seemed very
> > anti-X design (clients should create, or at least name, transient
> > resources), and has the potential of generating tons and tons of objects
> > if the client forgets to delete them or can't keep up with the damage
> > events.  Also, importing X sync objects to GL is expensive, so it's
> > desirable to make that a startup-time operation.
> > 
> > -Have the client provide the ringbuffer of objects to X and have it
> > figure out which one to trigger on every damage event.  I don't think I
> > ever discussed this with anyone.  I dismissed it as hiding too much
> > magic in X.
> 
> Yeah, I don't think these are workable. In addition to the issues you
> mention there's the basic problem that the compositor would need to use
> DamageReportRawRectangles since in the other damage modes you don't
> necessarily get an event for the *last* damage. So that makes anything
> where a separate sync object is used for each damage report - whether
> newly created or in a ring buffer - much worse. (The need for
> DamageReportRawRectangles also applies to the counter version.)
> 
> I don't really have a better suggestion for integrating explicit two-state
> glWaitSync() style fence objects with damage. Essentially, it's going to
> require grouping damage into coherent frames to be workable. The X server
> can't do the grouping because there is no frame grouping in the protocol.
> Your proposal has the compositor do the grouping. This requires the
> round-trip described above, and does increase latency but doesn't require
> application modification. I think that's fine if we actually know what we
> are doing to modify apps - if we can fix up GTK+ and Qt and OpenGL, then a
> little extra latency for Motif apps - who cares.

Right.  The end goal is to make client apps do their part and explicitly 
notify X and composite managers when they're done rendering a region.  That 
and making composite managers double as presentation managers, where they 
perform the final step of buffer flipping/region blitting/etc that constitutes 
presentation and can therefore handle any associated VSync/scanline waits, 
explicit fencing, etc, are the keys to my X presentation proposals.

> But I can't say that I'm at all happy the idea that we'll have two sets of
> drivers, one where flushing rendering enables an implicit fence for
> subsequent rendering from that buffer, and one where it doesn't. If the
> implicit fence approach is actually less efficient... work being
> pointlessly done, then I say we should just kill it everywhere and move to
> a consistently looser model.

Agreed.  I can't force other drivers to change, but I don't think buffer 
accesses should be implicitly fenced by drivers.  Two coordinated threads, for 
instance, should be able to simultaneously render to two regions of the same 
drawable/buffer.  OpenGL, and I believe X, allow for this.

> > > [...]
> > > 
> > > > I admit this isn't an ideal work-flow, and yes it is one more layer
> > > > of hard- to-test voodoo needed to write a robust TFP/EGLimage based
> > > > composite manager, but it's the best we can do without modifying
> > > > client applications.  However, fence sync objects provide a basis
> > > > for all kinds of cooler stuff once you start defining new ways that
> > > > client applications can use them to notify the composite manager
> > > > when they've finished rendering a frame explicitly.  Then the extra
> > > > step of telling X you want notification when some rendering you've
> > > > already been notified of has completed will go away.  The rendering
> > > > notification (damage event) and a pointer of some sort to the sync
> > > > object that tracks it will arrive together.  That's what I'll be
> > > > working on after the initial object support is wrapped up.
> > > 
> > > It worries me to see a pretty complex, somewhat expensive band-aid
> > > going in *without* knowing more about that long term picture.
> > > Obviously if the fence objects are useful for other things, then that
> > > reduces the complexity of the band-aid a bit.
> > 
> > While I don't have the code changes that rely on this change ready for
> > the "cooler stuff" yet, one such future application, multi-buffering is
> > discussed in the second half of the PDF link I sent in the last
> > response.  I understand there is some hesitance to reintroduce a
> > multi-buffered approach to rendering in X when the previous multibuffer
> > extension was mostly (completely?) unused and it can bloat memory
> > footprints, but I do think layering it on top of composite and taking
> > into account the efficiency gained by offloading the buffer swapping to
> > composite managers makes it a lot more interesting.
> 
> Note that the same behaviors that make fences unnecessary for Damage for
> the open source drivers makes them also unnecessary for any sort of
> multibuffering. Once the app is done rendering a frame, it flushes to the
> kernel before "submitting" the buffer.

This assumes submitted buffers execute atomically in the order they are 
submitted, rather than in parallel or in some order determined by an on-chip 
scheduler.  Also, I see specialized fences, much like the existing XSync 
system counters, being used for things like vblank or waiting on a particular 
timestamp or hsync line to be reached too.  The basic framework added here 
could be used for all types of boolean synchronization conditions.

> > Multi-buffering also allows true tear-free rendering in X.  Right now,
> > composite provides double buffering, but doesn't eliminate all tearing
> > because applications can be asynchronously rendering to the composite
> > backing buffer while the composite manager is texturing from it.
> 
> I suppose tearing is possible on hardware with multiple simultaneous
> execution units. But it's not really the most critical problem. The most
> critical problem is a) providing mechanisms for X apps to integrate with
> the compositor redraw cycle b) hooking up the synchronization mechanisms
> of OpenGL to the compositor redraw cycle instead of to the VBlank.
>
> If the compositor only redraws when it has new frames, and apps only
> draw new frames after the last frame has rendered, then you reduce a lot
> of the possibility for bad interactions.

I agree.  As I mentioned above, I'm working on both problems.

> > Applications eliminate most
> > of that by doing their own double-buffering: They allocate a window-size
> > pixmap, render to it, then blit it all to the window/composite backing
> > buffer at once.
> 
> I really don't know any apps that do that. Apps almost always generate a
> pixmap the size of the update region. Doing partial updates is pretty
> essential for power consumption, and I think whatever we do for
> frame-synced output needs to respect that.

Sorry, that's what I meant.  Partial updates are definitely important for most 
apps.  Some apps, such as video players, have both the need to update the 
entire window on each frame, and the need for carefully synchronized final 
presentation.  I think the same basic framework can solve the needs of both.

> [...]
> 
> > Fence syncs can also be used as a more powerful, API-agnostic version of
> > glXWaitX()/glXWaitGL.  While glXWaitX() waits for X rendering on a
> > particular display to complete before allowing GL rendering on a
> > particular context to continue (and vice-versa for glXWaitGL()), fence
> > sync objects can operate across any asynchronous rendering stream on an
> > X screen.  A multi-threaded client with one display per thread, one for
> > X, one for GL, could synchronize the two using fence sync objects.
> 
> Is this a common need?

I'm not sure it's common, but I don't see any reason not to enable it.

> > In general I believe explicit back-end synchronization objects are a
> > powerful tool.  I don't doubt there are more uses for them out there
> > than I can enumerate at this time.
> 
> Not completely sold on the idea here ... I still think that it's being
> rushed in without understanding how it fits in with more important stuff
> like frame synchronization, but I'm just an outside observer. I can add
> the code to GNOME 3 as needed.

Thanks, I'll try to get sample code out once I polish all the X and driver 
code off.

Note Aaron presented all of these concepts, including the new frame 
presentation ideas over a year ago (at XDC2009) after a lot of internal 
discussion at NVIDIA.  We got very little feedback.  It's good to get some 
now, but I don't feel we rushed anything.

> [ In terms of GNOME 3 and NVIDIA: If it's *that* slow to update clip lists
> for a GLX window, then just save the last one you got, and during
> ValidateTree memcmp() and if nothing  changed, don't do anything. Can't be
> more than 20 lines of code. Would make thousands of GNOME 3 users happy ]

Can you point to a more specific use case (start app A, drag it over app B, 
etc) We've got a huge backlog of work to do in this area, but specific worst-
case examples are always good.

> > > [*] If it was documented that a Damage event didn't imply the rendering
> > > had hit the GPU, then the X server could be changed not to flush
> > > rendering before sending damage events. In the normal case where the
> > > rendering is just a single glXSwapBuffers() or XCopyArea() that doesn't
> > > actually improve efficiency, but it does slightly reduce extra work
> > > from the fence. On the other hand, that would change this exercise
> > > from "fixing a corner case that misrenders on one driver" to "breaking
> > > every non-updated compositing manager".
> > 
> > Its been noted several times that X damage events only guarantee
> > subsequent X rendering (and by extension, any rendering of extensions
> > that are defined to occur in-band with X rendering, which GLX explicitly
> > does not guarantee) will happen after the damage has landed, and my
> > updates to the damage documentation explicitly document this.
> 
> Does interpretations of the spec really matter here? I'm not aware of
> any compositing manager that called glXWaitX and even if they did, the
> spec for glXWaitX says:
> 
>  X rendering calls made prior to glXWaitX are guaranteed to be executed
>  before OpenGL rendering calls made after glXWaitX. While the same
>  result can be achieved using XSync, glXWaitX does not require a round
>  trip to the server, and may therefore be more efficient.
> 
> Implying that the X rendering calls are "executed before OpenGL
> rendering calls" in only the weak form implied by XSync() or reception
> of Damage events. It would be pointless to XSync() on the reception of a
> Damage event, so it's pointless to glXWaitX()?

Yes, glXWaitX() isn't enough.  If it was, I probably wouldn't have made fence 
syncs.

> We can say that the combination of TFP and Damage was broken, and there
> was no way to write a correct compositing manager using TFP, but we
> can't say that they were correct but there was no way to write a correct
> compositing manager using TFP, since that's just silly. The only point
> of TFP is to write compositing managers.
>
> Anyways, that's just an argument against worrying too much about what
> the specs say instead of how we want to move forward.

Yeah.  When I was trying to finalize the TFP spec, I acknowledged that there 
was no way to make a "compliant" composite manager using it.  There's some 
language to that effect in the issues section.  A few people were actually very 
upset about this, but most accepted it since things generally work fine.  Users 
were excited and eager to start using GL-based compositing, so we put off 
addressing the large theoretical synchronization issues it introduced.  
However, they're not completely theoretical anymore, and in the end, all I'm 
really interested in is making everything work, then making everything better.  
If that requires some spec fiddling or bending, so be it.

Thanks,
-James

> - Owen