New fb-based DDX, performance regressions relative to xfree86
Timothy Normand Miller
theosib at gmail.com
Mon Dec 14 18:32:22 PST 2009
I'm working on the fb-based DDX I mentioned a while back. I'm
comparing performance on the same machine between RHEL3 with a
cfb-based DDX vs. RHEL5 with the new fb-based DDX. This is all 2D
accel, and our card is being used exclusively for 2D stuff, where we
accelerate 2D primitives that most most modern cards wouldn't dream of
(which is why we can't use XAA, and we have an acceleration framework
of our own already). Although our design is radically different, you
might think of our GPU as being in spirit like an i128 series 2 on
major steroids.
I know that x11perf is a terrible thing to use to judge performance,
but it's not a terrible thing to use for regression testing, which is
what I'm using it for. I have x11perf numbers from the two DDX
modules, and I have a program to compare them. Also, some of our
customers ask for xmarks.
Here are the columns:
1: x11perf.rhel5.24
2: x11perf.rhel3.24
1 2 Operation
-------- ----------------- -----------------
For most things that get decomposed into spans, the performance is the same:
3450.0 3610.0 ( 1.05) 500-pixel wide circle
3330.0 3330.0 ( 1.00) (xor) 500-pixel wide circle
But for some things, the new DDX is worse:
2800.0 4910.0 ( 1.75) 100-pixel wide dashed circle
238.0 917.0 ( 3.85) (xor) 100-pixel wide dashed circle
2680.0 5290.0 ( 1.97) 100-pixel wide double-dashed circle
245.0 908.0 ( 3.71) (xor) 100-pixel wide double-dashed circle
Between xfree86 that supported cfb and x.org that doesn't, have any
major changes been made to mi? I'm wondering, for instance, if mi is
ordering spans differently? Or have bugs been fixed that might have
an impact here? So far, I can't figure out if I have a mistake in my
span rendering code of if I'm being sent something different to
render. My span rendering code (and hardware) is WAY faster if the
spans are sorted in ascending order of Y coordinate. They used to be
(more or less). Are they still?
Is there a qsort I can get at from within a DDX? If I can't get at
the glibc qsort, I'd like to try something else.
Interestingly, the new DDX is faster for some polygons (most are on par):
224000.0 185000.0 ( 0.83) Fill 10x10 stippled trapezoid
216000.0 180000.0 ( 0.83) (xor) Fill 10x10 stippled trapezoid
This is another weird one:
177000.0 664000.0 ( 3.75) Copy 10x10 from pixmap to window
165000.0 494000.0 ( 2.99) (xor) Copy 10x10 from pixmap to window
29500.0 33700.0 ( 1.14) Copy 100x100 from pixmap to window
20400.0 22200.0 ( 1.09) (xor) Copy 100x100 from pixmap to window
176000.0 658000.0 ( 3.74) Copy 10x10 from window to pixmap
166000.0 498000.0 ( 3.00) (xor) Copy 10x10 from window to pixmap
176000.0 659000.0 ( 3.74) Copy 10x10 from window to window
165000.0 492000.0 ( 2.98) (xor) Copy 10x10 from window to window
We're on par for 100x100, but we're slow for 10x10. This is all fully
accelerated, copying from a pixmap in graphics memory to the screen.
I ran into issues in the past where just CPU overhead in DIX and my
DDX was dominating on small copies. The overhead in the DDX should
not have changed between versions. And I optimized the heck out of
this in copyarea, where I detect things like one-rect clipping, that
the rect to copy is confined to the clipping region, etc., and I
shortcut the hell out of it. Anything relevant changed in DIX?
For these lots-of-small operations, I'm wondering if I'm having a
protocol transport problem. How can I make sure that x11perf is using
the fastest transport?
This might point to that problem:
53000.0 73100.0 ( 1.38) PutImage 10x10 square
51400.0 70100.0 ( 1.36) (xor) PutImage 10x10 square
642.0 2010.0 ( 3.13) PutImage 100x100 square
643.0 2010.0 ( 3.13) (xor) PutImage 100x100 square
25.9 94.5 ( 3.65) PutImage 500x500 square
25.9 94.5 ( 3.65) (xor) PutImage 500x500 square
Actually, the 3x performance difference on putimage might suggest a
PCI problem, that write-combining isn't happening. It's interesting
that the performance hit doesn't get much worse than 3. How can I
make sure that write-combining is being used? For 500x500 putimage,
performance drops from about 90 megs/sec to about 25 megs/sec. On
33MHz PCI, 25 megs/sec is what you'd expect if you're getting no
bursts when writing.
I'm actually using memcpy to do the putimage. I've heard that for
64-bit systems, glibc memcpy is really awful. But I think this is
being tested on a 32-bit system. Could this be the problem? I'll
have to compare this against fb's native putimage, which (I presume)
uses MMX. (I really wish I could get read-combining, but even using
SSE, I can't get more than 2-word bursts, and I sadly don't have DMA.)
This is how I'm mapping the graphics memory:
fb = xf86MapVidMem(pScreen->myNum, VIDMEM_FRAMEBUFFER, base, size);
And this is how I'm mapping the engine:
engine = xf86MapVidMem(pScreen->myNum, VIDMEM_MMIO, base, size);
Both of them could benefit from write-combining, really. I thought
that xf86MapVidMem was supposed to enable MTRR write combining
automatically. Am I missing something?
I'm looking through my span code to see what might be different.
Could I be doing something stupid with the clipping of spans? Below
is how I do it. Some of this is hold-over from cfb.
pClip = fbGetCompositeClip(pGC);
n = nInit * miFindMaxBand(pClip);
i = (n * sizeof(int));
j = i + (n * sizeof(DDXPointRec));
pwidthFree = GET_SCRATCH_BUF(j);
/* check oom */
pptFree = (DDXPointRec *)((char *)pwidthFree + i);
pwidth = pwidthFree;
ppt = pptFree;
n = miClipSpans(pClip, pptInit, pwidthInit, nInit,
ppt, pwidth, fSorted);
Is it possible to profile a modular x.org?
What's bugging me is that some things are slower, but very little is
faster. So far, going to the new X server has been a lot of
regression, so I'm trying to figure out what I might be doing that's
inappropriate.
Any suggestions?
Thanks!
--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
More information about the xorg-devel
mailing list