render improvements

Mon Apr 18 14:21:19 PDT 2005

Zack Rusin <zack at kde.org> writes:

> > While it's better to have a fast general case and no special cases
> > then a slow general case that gets hit and some ultra-fast special
> > cases, it's still better to have a fast general case and some
> > ultra-fast special cases.
> >
> > I don't think you did any testing of the xorg code rendering to
> > system memory? Once I get my patch merged, it might be interesting
> > to try your benchmarks against Xephyr and compare that to your
> > code.
> 
> Is that a challenge? ;) Either way that's not something I'm worried 
> about for two reasons:
> a) merging special cases is trivial so we can do in a few minutes 
> without any problems,

It is actually quite annoying because of the huge switch/case/if
construction in fbpict.c. It would be nice if we could get rid
of it somehow, perhaps replacing it with a hashtable mapping for
(sfmt, mfmt, dfmt) to functions or something.

> b) operating on scanlines in general gives us more power to use MMX to 
> optimize the general case itself,

Right, plus it makes it easier to fool around with things like gamma
corrected compositing, something I think will be a big quality win.

> Right now the fact that Lars was sitting in front of the assembly dump 
> trying to figure out how to combine everything in a most efficient 
> manner helps quite a bit :) On a real server the combining methods are 
> hardly visible though. It's the fetch/store cycle that's killing us. If 
> we could optimize fetching we could easily get a huge improvement. I'd 
> like to look into what Alan suggested.

The timings that I did suggested that 64 bit MMX reads were the fastest
way to read from the framebuffer, but I didn't test with DMA.  In the
framebuffer case, it basically doesn't matter all that much how you
combine the pixels.

Doing 64 bit MMX reads gave me (if I remember correctly - I can't find
the data right now) 48 MB/s, which is not usable for dragging
translucent windows around, and barely acceptable for antialiased text
if you carefully optimize out reading the destination for fully opaque
and fully transparent pixels.  Going to 128 bit SSE reads did not make
a difference on the systems I tested on. I am attaching my framebuffer
read benchmark; to use it, boot with vga=0x317 to get a framebuffer
device.

When source and destination is in system memory, things change. Memory
bandwidth is still important, but not nearly as much. It matters a lot
how you do the unpacking, the combining and the packing. MMX and
Altivec were created for things like that, so it isn't surprising that
you get big improvements by using them.

I think it would be interesting to see if the combination of MMX and
entire scanlines is an improvement. Even if it turns out to be only a
modest slowdown I think it is still worth doing because it it
certainly a huge speedup compared to the old fbComposeGeneral().

What do I need to run the benchmark? I tried with Qt4 beta 2, but
compilation fails with:

great-sage-equal-to-heaven:~/render_bench_ops% c++ -I/usr/include/Qt *.cpp
main.cpp: In function void main_loop():
main.cpp:174: error: class QPixmap has no member named x11PictureHandle
main.cpp:175: error: class QPixmap has no member named x11PictureHandle

Soeren