radeon planar textured video
sroland at tungstengraphics.com
Fri Feb 20 19:43:29 PST 2009
So currently the driver converts planar yuv data to packed when using
textured video. And I fought so hard to convince the overlay scaler to
accept planar data correctly ages ago :-).
Anyway, when trying to view HD clips on my rs690, I noticed that Xorg
indeed consumes quite a few cpu cycles, and some oprofile quickly
revealed RADEONCopyMungedData as the top cpu hog as expected.
So I figured what was a good idea for the overlay scaler should be a
good idea for textured video, just copy planar data and change the
However, in contrast to the overlay scaler there are some drawbacks
here, mostly the gpu will have to work harder cause it needs to sample 3
textures instead of 1, and the shader will be more complex. I don't
think it should be much of a problem, since full 1080p25 requires a
fillrate of "only" 50MT/s * 3, and even the slowest r300-based igp
should have around 600MT/s IIRC (rs690 has 1600MT/s).
Of course, it would also allow to change the coefficients used by the
yuv->rgb conversion easily (I think some sources are actually meant to
use a different spec here).
The attached patch does exactly that (ok it was halfway copied from the
intel driver), a couple of comments:
- r300 only for now. r500 obviously doable, r200 should be possible too
(I think that hopefully even the 2-pipe igp chips might be fast enough,
with the added benefit that rv250 would get textured video too, as this
doesn't rely on hw yuv-rgb conversion which is broken on that asic).
Dunno about r100, it can't really run the necessary shader for such a
conversion, however it has a PLANAR_YUV_ENABLE bit, which I don't know
how it works and how the chip would need to be configured (but if it
works should be very efficient on the upside). Also, this is mutually
exclusive with bicubic filter (certainly could be done).
- The xv attr for using the new code is rather for performance debugging
than anything else... Speaking of that, naturally the
RADEONCopyMungedData disappeared from the oprofile data, getting
replaced by more libc usage (for memcpy which was expected) and
initially way higher delay_tsc usage (which wasn't quite expected) with
performance numbers (using mplayer's benchmark mode) actually slightly
lower in some cases... However some tests revealed it seemed to be due
to texture access latency or something along these lines (using textures
with both macro and micro tiling improved things though I couldn't quite
make it work for now due to blitter misconfiguration) I ended up with
the manual texture cache configuration which is now indeed always faster
(makes me wonder what performance gains you could get if you'd do that
for the 3d driver and more interestingly HOW you'd do that sensibly there).
- Overall I saw an increase of up to 10% using mplayer's benchmark mode
(this was with ffmpeg-mt, and a X2 4850e), considering it's just barely
faster than realtime every bit helps...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 22631 bytes
Desc: not available
More information about the xorg