Pushing image transport logic down the stack

Mon Sep 4 16:32:50 PDT 2006

On Mon, 2006-09-04 at 18:21 -0400, Owen Taylor wrote:

> It's certainly a seductive idea that we might be able to get the
> graphics card pointed directly at the application's data or a bit
> of the GTK+ mmap'ed icon cache and avoid troubling the processor's
> cache. A much better result. But I think that idea is really the
> enemy here: to get it going would require API changes at every layer,
> from the application to the graphics card.

GL has been moving this way for a long time now; immediate mode finally
falling to the current 'everything is a buffer' model. We can build
immediate mode operations on top of buffers, but it's hard to do it the
other way around.

We've currently got three GL architectures:

 1) Indirect mode, pumping all data through a socket to a separate
process

 2) Direct mode. Entirely bypassing the X server, DMA directly from
application address space.

 3) Managed mode. Bulk data is placed in DMA buffers, commands to
separate process queue data for execution in the engine.

COmbine either of the latter two models with a programmable graphics
engine and we can start talking about high-level data streams encoded in
the DMA queue, rather than just image transfer. Eliminating the CPU from
the data reformatting piece of the puzzle offers some interesting
performance opportunities; especially as we seek to avoid polluting the
CPU data cache with miles of boring pixmap data.

Xv is an interesting model; the CPU never touches the data in the ideal
case, that's what we'd like to see for all image data.

> What interests me is getting copies down as far as possible without
> major changes to the model. There is a lot of complexity and
> inefficiency in what goes on right now that is just extraneous.

Distinguishing between copies and data flow management is important; we
can pass data through the X server from the client as long as we don't
copy it. It needn't ever exist in the X server's address space, which
seems like a useful optimization.

One question here is how to notify the application that the data have
been read by the graphics card. A possibility not yet addressed is to
invalidate the PTEs for the data and block the process when it tries to
write additional data there, or perhaps even allocate new pages in place
of the old ones so that the application can paint the next frame in the
same logical location. I know this is currently inefficient, but there's
a strong interest in fixing these kinds of hardware issues in coming
generations.

> > Neither solves the network case where you really want the object
> > residing in the remote server.

We can push the data around on the net as long as we don't have to run
it through the remote CPU on the way to the graphics engine.

> Well, that's again hard, because you have to guess the intent of the
> application... and then worry about applications with good intentions
> that hurt the overall system.

As we're talking about graphics, we start making assumptions about the
overall use of the system being focused on a single user; concerns about
DOS from user-initiated applications have (so far) largely been ignored,
and I'd like to try to continue ignoring them...

> As I was out hiking today, it occurred to me that the above is pretty
> much nonsense: If the goal is to reduce copies of data during
> communication with the X server, we already know how to do it: use
> a SHM transport for the X protocol. There is no need to get the
> kernel involved with buffer management.

The problem is you trade data copies for inter-process synchronization
and complex storage management issues. If we could make SHM entirely
asynchronous through clever kernel APIs, it would be far more usable in
practice.

> You still have the question of what to do about images that exceed
> the size of your protocol buffer:

Providing some way to pass chunks of address space from application to
server without needing to pre-define them seems important here; it would
be nice to just hand the hardware a pointer to the image buffer and have
it able to access it directly. What kinds of kernel APIs would we need
for this, and what kinds of hardware changes would be required to make
that performant?

>  - You could reference an external shared memory buffer; it's clear
>    that at some image size, allocating a new shared memory buffer is
>    better than copying data, but I have no idea what that point
>    is - is it 100k, 1M, 10M?

I'm interested in learning how we could create a shared object that
referred to existing address space. We already do that from user space
to kernel; why not between two processes?

-- 
keith.packard at intel.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.x.org/archives/xorg/attachments/20060904/3f1e5bfd/attachment.pgp>