New glucose code
keith at tungstengraphics.com
Thu Mar 29 04:11:55 PDT 2007
Daniel Stone wrote:
> On Thu, Mar 29, 2007 at 11:37:04AM +0100, Alan Hourihane wrote:
>> On Thu, 2007-03-29 at 06:18 -0400, Zack Rusin wrote:
>>> One thing I wasn't 100% convinced of was how well will it perform when going
>>> through whole OpenGL stack when doing simple (and small) blits or the like.
>>> It's not that I couldn't sleep because of it at night (I don't sleep for
>>> other reasons) but I was contemplating using DRI directly in those cases.
>> Some of the traditional fills/blits can be really slow. But they are
>> even slow in Xgl as well, so it's not glucose's fault. It's more to do
>> with optimisation of the 3D drivers now, and possibly even writing some
>> extensions that may help with utilizing the 2D engine when one is
>> actually available :-). But we're compositing, so it's bound to be
>> slower than the traditional methods. But I guess as hardware performance
>> improves so will this acceleration architecture.
> Yeah, but in some cases (say, a 10x10 blit/fill), it's not necessarily
> worth the overhead of setting up and tearing down for the simple op,
> particularly if you've got an active client and thus lock contention.
> So it needs some smarts as to when to just deal with it unaccelerated.
It all depends...
I'd argue that the whole locking thing is broken, and that even current
hardware can do a much better job of scheduling multiple contexts than
the brain-dead/lowest-common-denominator approach we end up with with
the hardware lock.
In fairness the original dri design attempted to do a lot of this, but
it was all in software, wasn't performant & it all got thrown out in the
name of reasonable single-client performance. It should be possible now
to have the best of both worlds.
The big cost with going unaccelerated is the wait for hardware idle and
flush of render caches. There may well be times when it is worthwhile
to pay that cost and get direct screen access, but for a 10x10 blit,
potentially followed by a *real* hardware op, it seems like it wouldn't
be worth it. This is analogous to the "pipeline stall" issue in cpu
optimization, but with a much bigger pipeline.
Note that at the moment it might well be worth going to software, but
only because the 3d stack is optimized towards a single context doing
big q3arena screenloads of rendering. The hardware itself can do much
better - through support for hardware context switches, multiple active
hardware contexts (eg per-context ringbuffers) & hardware scheduling, etc.
All of these mechanisms serve to reduce the overhead of doing that 10x10
blit in hardware, and thereby avoiding the drain/flush penalty. Better
still, with sufficient care, they can allow you to prioritize
user-interface blits *above* pending rendering. All this stuff has been
around since at least the i830, so it's not exactly new - we just have
to take advantage of it.
More information about the xorg