render improvements

Zack Rusin zack at
Fri Apr 15 15:13:32 PDT 2005

On Friday 15 April 2005 17:30, Owen Taylor wrote:
> Assuming that the temporary buffer fits into L1 cache, that isn't
> horribly bad, but it's is only going to be fast as an in-place
> algorithm if you don't get any pipelining between memory
> accesses and arithmetic in the in-place algorithm.
> Also, for compositing to video memory, you get a fairly big win
> by optimizing alpha=0, alpha=255 source pixels to not read from
> the destination, something that you can't do with your method.

Right. It's not exactly impossible becaue we still get the source first. 
It's just less practical because you would have to scan the source 
before fetching the destination. So you'd be forced to scan the source 
twice. At this point it might be worth it.

> While it's better to have a fast general case and no special cases
> then a slow general case that gets hit and some ultra-fast special
> cases, it's still better to have a fast general case and some
> ultra-fast special cases.
> I don't think you did any testing of the xorg code rendering to
> system memory? Once I get my patch merged, it might be interesting
> to try your benchmarks against Xephyr and compare that to your
> code.

Is that a challenge? ;) Either way that's not something I'm worried 
about for two reasons:
a) merging special cases is trivial so we can do in a few minutes 
without any problems,
b) operating on scanlines in general gives us more power to use MMX to 
optimize the general case itself,
Right now the fact that Lars was sitting in front of the assembly dump 
trying to figure out how to combine everything in a most efficient 
manner helps quite a bit :) On a real server the combining methods are 
hardly visible though. It's the fetch/store cycle that's killing us. If 
we could optimize fetching we could easily get a huge improvement. I'd 
like to look into what Alan suggested.

> > Also since now the combining methods operate on scanlines adding
> > code that would in a common way accelerate all operations by
> > combining a couple of pixels in one pass should be rather easy.
> You do most likely want to MMX optimize the pieces of your algorithm.
> All my experience is that MMX makes a large (> 2x) improvement for
> this kind of code.

I'm a PPC fan. Your MMX foo does nothing for me ;) 

> > Before we do that, lets decide what to do about convolution
> > filters. Start of them them is in the xserver but not in the xorg
> > or the specs. Glitz implements them already. We haven't implemented
> > them in our implementation. I wasn't sure whether I should bother
> > quite yet. This might be the right moment to figure out what to do
> > with them :)
> Hmm. I don't think that needs to block merging the rest. (it's mostly
> small bug fixes that got put into one bit of the xorg code or the
> other).

Personally I'd just like to know what's the official word on convolution 

> Do we want to link libxrender against libpixman and move the
> tesellator there? 

Ideally, yes!

> Do we think that XRenderCompositeDoublePoly() is 
> something people should be using at all?

To be honest my biggest worry is having tessellation code duplicated in 
a few places. Granted that right now it's only Arthur and Cairo but 
that's already two places where it should be shared. So having 
tessellator in a library that we could share would be very nice.


Winners compare their achievments to their goals, 
losers compare theirs to that of others. 

More information about the xorg mailing list