X performance problems with Matrox G550
Martin Ebourne
lists at ebourne.me.uk
Tue Nov 2 15:18:27 PST 2004
Hi,
I've been trying to track down performance problems with X on my system.
I have an Athlon 64 with a Matrox G550 running mythTV (a TV pvr
program). I am running myth on the second head (CRTC2) using the TV out.
In this configuration myth is doing software decoding of an MPEG2
stream, scaling it to the TV resolution, performing YUV to RGB32
colourspace mapping; X is simply copying the image to the videocard by
use of the XShmPutImage call.
While this is all running, top reports that X is using about 75% of the
CPU while myth is only using about 20%. Given the relative workload I'm
quite sure this indicates something is wrong. As an added indication,
with mplayer doing something similar via directfb (instead of X) only
about 35% of the processor is in use.
I have now oprofiled the system to try and locate the cause. I'm using
xorg 6.8.1 (Fedora Core 3 xorg rpms recompiled with dlopen and debugging
to make oprofile work). The system is actually Fedora Core 2 and I was
originally seeing this problem with xorg 6.7.
oprofile records 77% in libfb.so, all of it in fbBlt. If I annotate that
function then the critical bit is below:
# opannotate -sa /usr/X11R6/lib64/modules/libfb.so | more
...
: while (n--)
8 8.7e-04 : 19247: lea 0xffffffffffffffb4(%rbp),%rax
5 5.4e-04 : 1924b: decl (%rax)
1441 0.1565 : 1924d: cmpl $0xffffffffffffffff,0xffffffffffffffb4(%rbp)
1329 0.1444 : 19251: jne 19255 <fbBlt+0x705>
: 19253: jmp 192cd <fbBlt+0x77d>
: *dst++ = FbDoDestInvarientMergeRop(*src++);
5005 0.5437 : 19255: mov 0xffffffffffffffd0(%rbp),%rax
18 0.0020 : 19259: mov %rax,%rcx
: 1925c: mov 0xffffffffffffffd8(%rbp),%rax
58696 6.3760 : 19260: mov %rax,%rdx
13 0.0014 : 19263: mov 0xffffffffffffff98(%rbp),%eax
: 19266: and (%rdx),%eax
843343 91.6101 : 19268: xor 0xffffffffffffff94(%rbp),%eax
1134 0.1232 : 1926b: mov %eax,(%rcx)
: 1926d: lea 0xffffffffffffffd8(%rbp),%rax
4698 0.5103 : 19271: addq $0x4,(%rax)
14 0.0015 : 19275: lea 0xffffffffffffffd0(%rbp),%rax
: 19279: addq $0x4,(%rax)
704 0.0765 : 1927d: jmp 19247 <fbBlt+0x6f7>
...
Here there's one huge hit (91% of the function, nearly 70% of the whole
processor). It's listed against the xor but oprofile generally reports
an instruction late and it makes much more sense as the 'and' at 19266.
This relates to line 174 of
xorg-x11-6.8.1/xc/programs/Xserver/fb/fbblt.c.
Excerpt of the containing block:
if (destInvarient)
{
#if 0
/*
* This provides some speedup on screen->screen blts
* over the PCI bus, usually about 10%. But fb
* isn't usually used for this operation...
*/
if (_ca2 + 1 == 0 && _cx2 == 0)
{
FbBits t1, t2, t3, t4;
while (n >= 4)
{
t1 = *src++;
t2 = *src++;
t3 = *src++;
t4 = *src++;
*dst++ = t1;
*dst++ = t2;
*dst++ = t3;
*dst++ = t4;
n -= 4;
}
}
#endif
while (n--)
*dst++ = FbDoDestInvarientMergeRop(*src++); // THIS IS THE SLOW LINE
}
Now the comment makes it plainly obvious that this is accessing the PCI,
which is what I was expecting since the copy isn't being executed enough
times to cause the CPU usage, so instead it must be being crippled by
slow accesses. Naturally the first thing I did was enable the #if'd
section, though it didn't make any difference; makes sense since I don't
believe it's doing screen->screen copies.
So now is the point where I'm getting a bit stuck. Can anyone explain
why these accesses are so slow? It's an AGP x4 card, and the Xorg log
reports it as such (the agp option is in the xorg.conf). I get the
impression X is using some kind of slow access method when a faster one
is available, but that's guesswork. Certainly directfb seems to have a
faster way.
If this does make sense to someone, then what can I do about it? This is
causing our TV to drop frames and produce very jerky motion, so I'm
rather keen to sort it out!
I should add that a friend is seeing the same high X usage with XFree86
4.3.0 on an Athlon XP with the G550.
Cheers,
Martin.
More information about the xorg
mailing list