Optimization idea: soft XvPutImage

Wed Sep 17 05:22:18 PDT 2008

	I want to suggest a way we could eliminate a substantial
amount of data copying when playing video on X servers that do not
provide hardware video windows, including servers that offer the X
shared memory extension.  In common situations, I suspect that this
could reduce memory bus utilization for playing video by more than a
factor of two.

	I do not know if I have time to implement this optimization
right now, but I think it is potentially a big enough benefit that I
really ought to describe it here in case someone else wants to
implement it or can relieve me of thinking about it by showing me why
it will not work.

	The copy operation that I want to eliminate occurs when the X
server reads data from XPutImage (usually via a shared memory area)
and copies it to the frame buffer.  The amount of data copied is
particularly large because the image is often "stretched" from its
native dimensions (720x480 for DVD, for example) to the dimensions of
the display area (for example, 1920x1200 for full screen video on a
24" panel).

	To eliminate this copy, I want the X server to receive the
unstretched YUV image by XvPutImage provided by the Xvideo-2.2 ("Xv")
extension, as is done for video display hardware that provided video
windows, which typically do YUV->RGB and stretch in the display
hardware.  In this proposed Xv driver, which I will refer to as "soft
XvPutImage", the YUV->RGB and stretch operations would have to be done
in software by the X server, just as they are currently done in
software by video playing programs.  The difference is that by
combining this operation with the X server receiving the image, a big
copy operation is eliminated that might plausibly account for more
than half of the memory bus utilization in some common video playing
scenarios.

	I realize that most modern video hardware has YUV/stretch
video window capabilities or other hardware acceleration for this
operation (for example, in hardware 3D operations), but there are at
common cases in practice where this optimization should be useful:

	1) Improving the capabilities of the weakest systems would
	   allow video to be used more ubiquitously (for example,
	   adding video-based tutorials to larger application suites
	   might become more common).

	2) Many open source drivers lack this YUV/stretch capability
	   even if the hardware has it, due to lack of public
	   documentation or slow development in comparison to the
	   life cycle of the hardware, even though efforts to address
	   these problems are definitely helping.

	3) The following scenarios may fall under #1 or #2, but are
	   worth separate mention:

		a) On systems with more processor cores (typically ones
		   which have YUV/stretch hardware but lack drivers),
		   memory bus utilization will be especially important.

		b) "Fake" X servers, such as for VNC or when running
		   on a virtualized computer, are less likely to have
		   access to acceleration hardware (although it is
		   possible).

		c) There are those who believe that 3D acceleration
		   hardware will be traded off for more CPU cores in
		   typical systems of the future.  So, at least for the
		   case of playing video through a 3D effect, this
		   optimization may help.  See, for example, the
		   "Twilight of the GPU" interview on slashdot yesterday
		   at http://tech.slashdot.org/tech/08/09/15/2116240.shtml .

	4) There are also a couple of cases of small benefit I will note
	   for completeness:

		a) For video with a slow frame rate playing on a monitor
		   with a high refresh rate where the frame buffer and
		   video window are part of system memory (i.e., no video
		   RAM), where pixels in the frame buffer under the video
		   window are still fetched for chroma key comparison, 
		   Soft XvPutImage might actually use less bandwidth than
		   a YUV/stretch video window.

		b) Not part of this proposal, but a similar idea for
		   systems that have Xv but lack XvMC would be SoftXvMC
		   to eliminate a verbatim copying in of YUV data in Xv,
		   but the bandwidth savings would be more modest.

	To understand the possible bandwidth savings, here is a
calculation based on the scenario mentioned earlier: playing standard
DVD (720x480 yuv422) stretched to 1920x1200 (a popular full screen
resolution).

	To start, here is a list of data transfers that occur in the
early stages of video decoding, regardless of whether this soft
XvPutImage optimization is used.  (I believe yuv422 is 2 bytes per
pixel).  In the descriptions I refer to the media player as "mplayer"
and the video format as "MPEG" but the argument applies to user level
video players in general and most video formats, since the only thing
that is important about the video format is it achieves such good that
copying the fully decoded video is what dominates memory bus
utilization.

Common transfers with and without this optimization:

    kernel: MPEG data DVD player -> buffer      .66 MiB/sec  ("1X DVD")
    kernel: MPEG data buffer -> CPU             .66 MiB/sec
    kernel: MPEG data CPU -> mplayer buf        .66 MiB/sec
    mplayer: MPEG data mplayer buf -> CPU       .66 MiB/sec
    mplayer: VLD + iDCT                          ?
    mplayer: interpolated frames + motion comp.  80 MiB/sec???
                                                ---------
                                                 82.64 MiB/sec + ?

	Having separated the operations common to both cases, we now
can more easily compare the memory bandwidth of XPutImage vesus soft
XvPutImage (both using shared memory).

With XPutImage:

    Common transfers shown above                 82.64 MiB/sec + ?
    mplayer: CPU --> 720x480 yuv422 @ 60Hz      40 MiB/sec (720x480x2x60) *
    mplayer: 720x480 yuv422 -> CPU               40 MiB/sec
    mplayer: CPU -> 1920x1200 RGB(XPutImage)    527 MiB/sec (1920x1200x4x60)
    Xserver: 1920x1200 RGB -> CPU               527 MiB/sec
    Xserver: 1920x1200 RGB CPU->framebuffer     527 MiB/sec
                                                ---------
                                                1.70 GiB/sec + ?

With soft XvPutImage:

    Common transfers shown above                 82.64 MiB/sec + ?
    mplayer: CPU --> 720x480 yuv422 (XvPutImage) 40 MiB/sec (720x480x2x60) *
    Xserver: 720x480 yuv422 -> CPU               40 MiB/sec
    Xserver: 1920x1200 RGB CPU->Frame buffer    527 MiB/sec (1920x1200x4x60)
                                                ---------
                                                0.67 GiB/sec + ?

* yuv422 is 2 bytes per pixel.

	In the impossibly ideal case where no memory transfers are
generated by the variable length decoding, inverse discrete cosine
transformation, motion compensation and other activity, this
optimization would reduce memory bus utilization by a factor of 2.52
for a screen resolution of 1920x1200.  As screens get bigger (or if
you make the unstretched video resolution smaller) the improvement in
memory bandwidth utilization is asymptotic to 3.

	In reality, there would be other sources of memory bandwidth
utilization, reducing the fraction that this optimization accounts
for, but I expect this optimization would still be very substantial.
Consider that these bandwidth predictions are a substantial fraction
of the total CPU to DRAM bandwidth available on a typical computer,
which is the relevant resource when you consider that the size of the
stretched video is typically too large to fit in the CPU caches.  For
example, a DDR2-1066 memory module has a maximum transfer rate of
7.94GiB/sec. minus delays due to accessing different 16KiB DRAM
columns (double this for dual channel systems, half this for DDR2-533
memory).

	By the way, the effect of the CPU cache is likely to increase
the benefit of this optimization toward that ideal factor of 3
improvement, because only the earlier stages of video decoding
potentially have data footprints small enough for the CPU
data caches to eliminate any memory bus transfers.

	The memory bus utilization would also be reduced (but never
more than that factor 3) as the ratio of the size of the unstretched
video to stretched video increases, such as when playing a 720x480 video
on a newer 2560x1600 display.

	The potential benefits seem pretty substantial to me, and I
have more happy vaporware speculation: the implementation effort may
be quite small, because it should not be necessary to write the video
format conversion and stretch code, as that code can be taken from
existing free video players that do this on the client side,
particularly libswscale, which is already conveniently in its own
subdirectory of the ffmpeg library used by many video players.  Since
there are no formal releases of ffmpeg, and since ffmpeg already
maintains libswscale as a separate source control tree, it might make
sense to ask the ffmpeg people about evolving ffmpeg into a
freedesktop.org project with tar file releases, or merging it into
libpixman.  This would also at least require some header file movement
to allow libswcale to compile without ffmpeg.

	Finally, I would like to mention a couple of alternatives to
this approach, for completeness, because I do not want to give to
suggest that this is necessarily the end of the line for optimizing
software based video playing.

	Alternative #1: If you have 3D hardware and appropriate drivers
properly configured that supports YUV textures and support in your X
server, and lack Xv support for some reason, 3D surely will provide
better performance than any software XvPutImage, and there is
apparently already support for this in at least one video player.

	Alternative #2: I do not know enough about DRI to be sure
about this, but I think that a slightly more optimal approach for
systems without any hardware video acceleration would be to allow
trusted (i.e., privileged) video players to map the frame buffer
directly and coordinate changes to clip regions and window
coordinates, which I think is done already done for 3D using the
Direct Rendering Infrastructure (DRI).  As far as I know, DRI
theoretically does not actually require 3D, video players with
appropriate permissions should be able to use DRI to map the frame
buffer and draw to it.  However, DRI currently requires a DRM driver
for each type of video card, and this would require writing new (but
relatively simple) video output drivers any video player that would
use this approach, whereas Soft XvPutImage should work with existing
video players that already support Xv.  There are also the relatively
minor drawbacks of this approach being limited to the very common case
of the player running on the same machine as the frame buffer, the
frame buffer being memory mapped in a sensible format, and the video
player running with enough privilege to use DRI.

	I have written this description partly so that I will no
longer feel like a roadblock to more ubiquitous video playing by not
mentioning it and not getting around to implementing it, and also to
find out if there are compelling reasons not to do it, such as an
existing implementation that I have missed or an already existing
optimization in software video playing that makes this optimization
idea useless.  So, further analysis, corrections, and, of course,
implementation, would be most welcome.

Adam Richter