[PATCH 6/6] RADEONPrepareAccess_CS: fallback to DFS when pixmap is in VRAM

Wed Sep 1 04:26:09 PDT 2010

On Mit, 2010-09-01 at 10:37 +1200, Karl Tomlinson wrote: 
> 
> Michel Dänzer writes:
> 
> >       * Patch 4 introduces a big endian build break (typo, pSrc instead
> >         of pDst in RADEONUploadToScreenCS) which is silently fixed in
> >         patch 5. 
> 
> Sorry about that.  I can touch that up, but, if you don't mind,
> I'll wait for feedback to see whether other changes are requested.

Sure.

> >       * The r600 changes aren't split up properly,
> 
> I'm not clear exactly what you had in mind for a proper split.
> 
> The r600 code doesn't have the big-endian paths, so there's
> nothing to fix there until little-endian code also completes
> UTS/DFS even when a scratch BO is not necessary (part 5 in this
> series).
> 
> I don't want to activate the unflushed CS bug, fixed in part 3, in
> little-endian code (until it is fixed).
> 
> I guess part 4 (complete UTS and DFS even when scratch allocation
> fails) could be moved to after what is currently part 5, but there
> seems little point in completing UTS/DFS in this situation unless
> it ensures some reliability in the methods.

Okay I'm convinced, thanks for elaborating.

> >> I wondered whether PrepareAccess could fail for the visible screen
> >> with mixed pixmaps as suggested here
> >> http://www.mentby.com/maarten-maathuis/exa-classic-problem-with-xv.html
> >> When I tried that, however, I ended up with pixels in the wrong
> >> places, a bit like what I would expect if the pitch were wrong.
> >
> > I thought I tried that before and it worked. If the problem isn't a
> > straightforward pitch bug, it sounds like it might be tiling related?
> 
> Could be.  I have color tiling enabled.  I expect you know better
> than I whether this is likely.
> 
> Things worked fine for a while, then suddenly, on some apparently
> random activity, the whole screen got upset.  Colors looked
> reasonable AFAICT, though I wasn't sure which bits of the screen I
> was looking at.  IIRC lines bordering regions weren't perfect
> diagonals but more staircases, and it felt like there as a lot of
> repeating going on.

Have you tried if it also happens with tiling disabled? Is the effect
captured in a screenshot?

> > Basically, the reason why these paths were so far limited to big endian
> > (and even there excluded the screen pixmap, as that doesn't require
> > byte-swapping thanks to the surface registers) is that I thought other
> > people like Dave were opposed to using them everywhere based on the
> > belief that it was an overall loss compared to direct CPU access even in
> > VRAM.
> 
> Would you be able to point me any relevant discussion, please?

Not offhand I'm afraid, maybe we only discussed it on IRC.

> VRAM reads are not cached so I find it hard to imagine how reading
> one pixel at a time (4 times in the case of bi-linear filters) can
> be faster than a GPU blit.
> 
> I am certain that switching to direct CPU VRAM reads was a VERY
> BAD regression on my system, symptoms similar to trying to use
> offscreen pixmaps in the old days with XAA.

There's no question that direct VRAM access is very slow if the
operation needs to read a large number of the pixels. However that isn't
always the case, e.g. consider a 0-width line spanning a diagonal of a
large pixmap.

> > If that turns out to be wrong, or at least the worst case
> > behaviour is better, we should be able to make these paths work for all
> > pixmaps.
> 
> Sounds good, but doable as an additional improvement, I assume.

Sure.

> > Have you made any measurements with e.g. gtkperf or x11perf to
> > see if there are any bad regressions?
> 
> No.  What I measured was RepeatPad operator-over RGB24-source with
> transform.  The cost of direct VRAM read was at least 50 times
> the cost of the GPU blit.
> 
> Naturally, I've also been using my system with these patches for
> some time now.

That's fine, still it would be good to make some additional measurements
to make sure there aren't any bad regressions.

> One possible (smaller) regression is when EXA first creates
> a system copy during a write-only operation, it does an unnecessary
> DFS:
> http://lists.x.org/archives/xorg-devel/2010-August/012298.html
> 
> Patch here:
> http://lists.x.org/archives/xorg-devel/2010-August/012299.html

I saw that but haven't got around to reviewing it yet.

> Please do remember that most often improving the slowest paths is
> the most important.  It is the slow paths that get noticed.  In
> this case, it would take many gains in many operations to recover
> the 2 seconds lost trying to read from VRAM.

I agree and I am basically in favour of your changes (just discussing
some details) and I wanted to give some background about how the code
came to be in its current form.

-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer