New glucose code

Thu Mar 29 04:11:55 PDT 2007

Daniel Stone wrote:
> On Thu, Mar 29, 2007 at 11:37:04AM +0100, Alan Hourihane wrote:
>> On Thu, 2007-03-29 at 06:18 -0400, Zack Rusin wrote:
>>> One thing I wasn't 100% convinced of was how well will it perform when going 
>>> through whole OpenGL stack when doing simple (and small) blits or the like. 
>>> It's not that I couldn't sleep because of it at night (I don't sleep for 
>>> other reasons) but I was contemplating using DRI directly in those cases.
>> Some of the traditional fills/blits can be really slow. But they are
>> even slow in Xgl as well, so it's not glucose's fault. It's more to do
>> with optimisation of the 3D drivers now, and possibly even writing some
>> extensions that may help with utilizing the 2D engine when one is
>> actually available :-). But we're compositing, so it's bound to be
>> slower than the traditional methods. But I guess as hardware performance
>> improves so will this acceleration architecture.
> 
> Yeah, but in some cases (say, a 10x10 blit/fill), it's not necessarily
> worth the overhead of setting up and tearing down for the simple op,
> particularly if you've got an active client and thus lock contention.
> So it needs some smarts as to when to just deal with it unaccelerated.

It all depends...

I'd argue that the whole locking thing is broken, and that even current 
hardware can do a much better job of scheduling multiple contexts than 
the brain-dead/lowest-common-denominator approach we end up with with 
the hardware lock.

In fairness the original dri design attempted to do a lot of this, but 
it was all in software, wasn't performant & it all got thrown out in the 
name of reasonable single-client performance.  It should be possible now 
to have the best of both worlds.

The big cost with going unaccelerated is the wait for hardware idle and 
flush of render caches.  There may well be times when it is worthwhile 
to pay that cost and get direct screen access, but for a 10x10 blit, 
potentially followed by a *real* hardware op, it seems like it wouldn't 
be worth it.  This is analogous to the "pipeline stall" issue in cpu 
optimization, but with a much bigger pipeline.

Note that at the moment it might well be worth going to software, but 
only because the 3d stack is optimized towards a single context doing 
big q3arena screenloads of rendering.  The hardware itself can do much 
better - through support for hardware context switches, multiple active 
hardware contexts (eg per-context ringbuffers) & hardware scheduling, etc.

All of these mechanisms serve to reduce the overhead of doing that 10x10 
blit in hardware, and thereby avoiding the drain/flush penalty.  Better 
still, with sufficient care, they can allow you to prioritize 
user-interface blits *above* pending rendering.  All this stuff has been 
around since at least the i830, so it's not exactly new - we just have 
to take advantage of it.

Keith