Optimising xserver (Xft text rendering improvements)

Richard Purdie richard at o-hand.com
Tue Mar 22 05:49:46 PST 2005


I've been looking into where the Kdrive xserver spends its time under
different loads. This was specifically on an arm processor based system but
most of  the results apply in general. I've detailed what I found below.

The first loading was using mit-shm blits. This put most of the stress
directly onto the systems memcpy so the xserver couldn't really be improved.
So far so good.

The second and more interesting loading was Xft text rendering. A quick look
at the code in /fb where the test spent time confirmed a major problem:

    CARD8      op,
    PicturePtr pSrc,
    PicturePtr pMask,
    PicturePtr pDst,
    INT16      xSrc,
    INT16      ySrc,
    INT16      xMask,
    INT16      yMask,
    INT16      xDst,
    INT16      yDst,
    CARD16     width,
    CARD16     height

These twelve arguments get passed around several levels of functions. The
problem is that arm processors can only efficiently pass the first four
arguments. The rest get pushed on and off the stack for each function call.
I therefore changed the code to pass a structure around containing this data
and saw a significant improvement in speed. This wasn't without its problems
as sometimes the data in the structure gets manipulated but that generally
isn't the case and could be worked around. My patch to do this is:

http://projects.o-hand.com/xpatches/oh-addstruct

On a Pentium 4, only six registers are available for arguments so the stack
still gets used and this patch will therefore also improve efficiency on a
Pentium 4 etc.

I chased this structure up though various layers in the xserver with these
further patches:

http://projects.o-hand.com/xpatches/oh-addstruct1
http://projects.o-hand.com/xpatches/oh-addstruct2

increasing the performance at each step.

Passing 12 arguments to a function really is a performance killer and I'd
like to think this could be kept in mind when further developing xserver (or
any software in general!).

The most processor intensive function under my benchmarks was
fbCompositeSolidMask_nx8x0565() (We're mainly interested in the non
subpixel AA text on 16bpp fb) so I also looked into what could be done
there. I found several things:

* Changing types to 32bit and unsigned where possible removed sign
manipulation assembler and shifts to force registers to 16 or 8 bit.
* Certain operations could be moved outside the loop (dst = dstLine; src =
srcLine; can go outside and dst += dstStride; src += srcStride; can go at
the end of the loop).
* In the case of these loops, counting upwards and incrementing pointers
worked more efficiently than counting downwards.
* Turning cvt0565to0888 and cvt8888to0565 into functions worked out faster
as it reduced register pressure within the loops.

The patch I ended up using for optimal speed is:

http://projects.o-hand.com/xpatches/oh-optimise

The patch also alters fbCompositeSrcAdd_8000x8000 in the same ways as listed
above for the same reasons. Its likely these changes could be also be
applied to other composite functions.

I'd like to hope some of these findings could be worked back into the
server. I'm posting them here in the hope it generates some discussion and
if any of the features are found to be acceptable I can create a patch
containing those features.

With these patches we have noticed a 10-15% speed improvemnt on text
blitting on Arm.

Regards,

Richard 




More information about the xorg mailing list