Dispatching and scheduling--basic questions

Wed Sep 17 09:05:03 PDT 2008

On Tue, 2008-09-16 at 22:56 +0300, Daniel Stone wrote:
> On Tue, Sep 16, 2008 at 03:01:55PM -0400, Adam Jackson wrote:
> > Well, okay, there's at least two tactics you could use here.  We could
> > either go to aggressive threading like in MTX, but that's not a small
> > project and I think the ping-pong latency from bouncing the locks around
> > will offset any speed win from parallelising rendering.  You can
> > mitigate some of that by trying to keep clients pinned to threads and
> > hope the kernel pins threads to cores, but atoms and root window
> > properties and cliplist manipulation will still knock all your locks
> > around... so you might improve fairness, but at the cost of best-case
> > latency.
> 
> I suspect that on current Intel/AMD hardware, the lock cost is virtually
> zero (unless you have two threads noisily contending, with the worst
> case being that the individual requests take twice as long to return as
> previously, with the overall runtime being mostly unchanged -- but that
> can be mitigated by being smart about your threading).  On ARM-type
> hardware, due to our puny cache, you'd have to be smart about keeping
> your locks on the same cacheline or two, since dcache misses come often
> and hurt a lot.

Lock cost on x86 is relatively cheap on one die.  If you go off the die,
just pack it up and go home.  Admittedly this is a scaling problem only
for things you do millions of per second, which is not true of
atoms/properties/misc...

Still I don't like the complexity explosion.

> > Or, we keep some long-lived rendering threads in pixman, and chunk
> > rendering up at the last instant.  I still contend that software
> > rendering is the only part of the server's life that should legitimately
> > take significant time.  If we're going to thread to solve that problem,
> > then keep the complexity there, not up in dispatch.
> 
> Mm, then you have a server with a very good overall best case, but still
> a pretty terrible overall worst case.  What happens when an XGetImage
> requires a complete GPU sync (forget software rendering for a moment),
> which takes a while, then a copy? Bonus points if you have to stall to
> clean its MMU, too.  Then you memcpy it into SHM and get that out to
> the client, but in the meantime, all your other clients waiting for
> trivial requests are doing just that: waiting.

GetImage isn't ever going to win from parallelising the get itself,
since you'll certainly be bus-limited, so: start a GI thread and/or post
a download DMA, put the client to sleep until it finishes, complete it
in ProcessWorkQueue().  If you're doing memcpy for the GetImage that
won't help _too_ much since you'll still have to stall the GPU (or at
least partial-flush rendering to the surface in question), but you're
positing a working GPU, use DMA already.

You do contend for memory bandwidth either way I suppose.  Not much to
be done about that.  Also there's an ordering issue: the GetImage has to
appear to complete atomically, which means if you're doing software GI,
you have to sleep any other client that touches that drawable until the
GI finishes.  You could do that way up at the dispatch layer but wow you
just touched every operation.  Hmm.

- ajax
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.x.org/archives/xorg/attachments/20080917/da43b81c/attachment.pgp>