New fb-based DDX, performance regressions relative to xfree86

Tue Dec 15 08:48:30 PST 2009

On Mon, 2009-12-14 at 21:32 -0500, Timothy Normand Miller wrote:

> Here are the columns:
> 
> 1: x11perf.rhel5.24
> 2: x11perf.rhel3.24

RHEL3, for those playing along at home, is XFree86 4.3.0 plus about two
hundred patches, and RHEL5 is xserver 1.1.1 plus only about one hundred
patches.  Most of these patches, however, are not in the rendering path.

> For most things that get decomposed into spans, the performance is the same:
> 
>   3450.0     3610.0 (  1.05)   500-pixel wide circle
>   3330.0     3330.0 (  1.00)   (xor) 500-pixel wide circle
> 
> But for some things, the new DDX is worse:
> 
>   2800.0     4910.0 (  1.75)   100-pixel wide dashed circle
>    238.0      917.0 (  3.85)   (xor) 100-pixel wide dashed circle
>   2680.0     5290.0 (  1.97)   100-pixel wide double-dashed circle
>    245.0      908.0 (  3.71)   (xor) 100-pixel wide double-dashed circle
> 
> Between xfree86 that supported cfb and x.org that doesn't, have any
> major changes been made to mi?  I'm wondering, for instance, if mi is
> ordering spans differently?  Or have bugs been fixed that might have
> an impact here?  So far, I can't figure out if I have a mistake in my
> span rendering code of if I'm being sent something different to
> render.  My span rendering code (and hardware) is WAY faster if the
> spans are sorted in ascending order of Y coordinate.  They used to be
> (more or less).  Are they still?

The mi code itself changed very little between XFree86 4.3.0 and X.Org
6.7.0.  The biggest change appears to be the introduction of
miRegionEqual and the use of that consistently instead of open-coded
compares; this could certainly have some effect when sorting spans.
There's a modest amount of functional change in mi/ between 6.7.0 and
xserver 1.1.1, but none of it in the span code (mostly to do with the
extension enable/disable code and the fixes/damage/composite
integration).

The other possibility is that the compiler flags changed significantly
between RHEL3 and RHEL5.  RHEL3's miarc.c, for example:

gcc -m32 -c -O2  -pipe -march=i386 -mcpu=i686 -fno-strict-aliasing -pipe
-ansi -pedantic -Wall -Wpointer-arith -Wundef    -fno-merge-constants
-I. -I../include -I../../../include/fonts -I../render
-I../../../exports/include/X11 -I../../../include/fonts
-I../../../include/extensions -I../../../programs/Xserver/Xext
-I../../.. -I../../../exports/include   -Dlinux -D__i386__
-D_POSIX_C_SOURCE=199309L -D_POSIX_SOURCE -D_XOPEN_SOURCE -D_BSD_SOURCE
-D_SVID_SOURCE  -D_GNU_SOURCE  -DSHAPE -DXINPUT -DXKB -DLBX -DXAPPGROUP
-DXCSECURITY -DTOGCUP  -DXF86BIGFONT -DDPMSExtension  -DPIXPRIV
-DPANORAMIX  -DRENDER -DRANDR -DGCCUSESGAS -DAVOID_GLYPHBLT -DPIXPRIV
-DSINGLEDEPTH -DXFreeXDGA -DXvExtension -DXFree86LOADER  -DXFree86Server
-DXF86VIDMODE -DXvMCExtension  -DSMART_SCHEDULE  -DXResExtension
-DX_BYTE_ORDER=X_LITTLE_ENDIAN -DNDEBUG  -DFUNCPROTO=15 -DNARROWPROTO
miarc.c

but in RHEL5:

gcc -DHAVE_CONFIG_H -I. -I. -I../include -I../include -I../include
-I../include -I../include -I../include -I../include -I../include
-I../mfb -DHAVE_DIX_CONFIG_H -DXFree86Server -DXFree86LOADER -Wall
-Wpointer-arith -Wstrict-prototypes -Wmissing-prototypes
-Wmissing-declarations -Wnested-externs -fno-strict-aliasing
-D_BSD_SOURCE -DHAS_FCHOWN -DHAS_STICKY_DIR_BIT -I/usr/include/freetype2
-I../include -I../include -I../Xext -I../composite -I../damageext
-I../xfixes -I../Xi -I../mi -I../miext/shadow -I../miext/damage
-I../render -I../randr -I../fb -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic
-fasynchronous-unwind-tables -MT miarc.lo -MD -MP -MF .deps/miarc.Tpo -c
miarc.c  -fPIC -DPIC -o .libs/miarc.o

I initially suspected the -fstack-protector bit to make a difference on
function-call-heavy code paths, but having built Xvfb with and without
it, I can't measure any statistically significant difference for
-wdcircle100.

> Is there a qsort I can get at from within a DDX?  If I can't get at
> the glibc qsort, I'd like to try something else.

You have all of glibc at your disposal in RHEL5.

> 177000.0   664000.0 (  3.75)   Copy 10x10 from pixmap to window
> 165000.0   494000.0 (  2.99)   (xor) Copy 10x10 from pixmap to window
>  29500.0    33700.0 (  1.14)   Copy 100x100 from pixmap to window
>  20400.0    22200.0 (  1.09)   (xor) Copy 100x100 from pixmap to window
> 176000.0   658000.0 (  3.74)   Copy 10x10 from window to pixmap
> 166000.0   498000.0 (  3.00)   (xor) Copy 10x10 from window to pixmap
> 176000.0   659000.0 (  3.74)   Copy 10x10 from window to window
> 165000.0   492000.0 (  2.98)   (xor) Copy 10x10 from window to window
> 
> We're on par for 100x100, but we're slow for 10x10.  This is all fully
> accelerated, copying from a pixmap in graphics memory to the screen.
> I ran into issues in the past where just CPU overhead in DIX and my
> DDX was dominating on small copies.  The overhead in the DDX should
> not have changed between versions.  And I optimized the heck out of
> this in copyarea, where I detect things like one-rect clipping, that
> the rect to copy is confined to the clipping region, etc., and I
> shortcut the hell out of it.  Anything relevant changed in DIX?

This would be consistent with changes in function call overhead, but,
-fstack-protector doesn't seem to have much effect.

One other possibility is to just build cfb straight into your driver and
compare with that.  It's a bit of typing to implement, but it'd work...

- ajax
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://lists.x.org/archives/xorg-devel/attachments/20091215/e6109ecf/attachment.pgp