[Xorg-driver-geode] Plans for next releases

Wed Apr 6 17:33:24 PDT 2011

Hello,

Here my two (now Euro) cents then, throwing it all here, some is also
replies and comments to the rest of the thread up to now.

On R, 2011-04-01 at 12:34 +0200, Christian Gmeiner wrote:
> Hi all,
> 
> I want to know what the plans for the next releases of the driver are.

I am not aware of any concrete plans - whichever worth releasing code
actually appears by a volunteer, I suppose. If you mean a new big
release, as opposed to tiny bug fix point releases, that is.

Currently GIT master doesn't have anything release-worthy

> Currently I am using the geode driver for an embedded system and I am not that happy
> with the speed of the driver with enabled hardware acceleration. At the moment
> the driver preforms much better if I disable acceleration.

In my measurements, completely disabled acceleration didn't seem to
perform better. That said, it does seem that some very often hit cases
are subjectively feeling a lot faster with NoAccel.
http://people.freedesktop.org/~leio/geode/perf/NoAccel-vs-HwAccel-2010-06-17-absolute.png

I think Option "EXANoComposite" "yes" was benchmarked to be somewhat
slower too in many cases. I don't have all the old results handy, but I
bet I must have posted some into this mailing lists archives, alongside
other snippets of information.

> I am using a Debian based system with X.Org X Server 1.7.7.

I strongly believe that any performance work should be conducted on an
up to date X.org stack, which is at least X server 1.10, etc. We can't
reasonably optimize for old versions - we might then find it being slow
in modern versions instead, or miss out on opportunities to make it
faster easier. That said, the biggest change in acceleration related
behaviour seemed to happen slightly before 1.7 series.

> Where are the current problems in the area of hardware acceleration?

The main problem is that some RENDER Porter/Duff Composite operations
are not hardware accelerated and causing what is know amongst the X.org
people as "EXA pixmap migration ping-pong". This shows itself as memcpy
being on top of CPU benchmarks for X.org drawing (profiled with e.g
sysprof, oprofile or new kernel perf tools)

In the currently used "EXA classic mode" scheme, EXA generic code is
responsible for video and normal memory management for pixmaps. It
assumes that only video memory can be used for drivers hardware
accelerated operations, and only normal memory can be used by software
fallback rendering (with calls by EXA code into pixman library when
drivers CheckComposite or PrepareComposite hooks return FALSE).
This means that for many drawing operations that can't be hardware
accelerated, it needs to "download" pixmaps (in our case boils down to a
simple memcpy) that have been "uploaded" to video memory back to system
memory in order to work with the latest contents of the pixmap. This
downloading or uploading of a pixman is then called a pixmap migration.
And then some other drawing operation is done on that pixmap, which can
be hardware accelerated, so it uploads it back to video memory and does
the accelerated operation. You should see how this can result in a lot
of back and forth copying if many commonly done operations are not
hardware accelerated. There are some heuristics to try to reduce this
(not sure how, but maybe by example sometimes forcing a software
rendering if it figures the next operations on it need to be done in
software too anyway, so don't put it in video memory for a tiny hwaccel
operation, etc), but these don't work all too well and we can do better
than relying on those working good..

To the best of my knowledge, only the reserved for video memory RAM
portion can be used for hardware acceleration (at least there the result
must be put, but to my knowledge also the source, not 100% sure on the
latter).
However as the video memory is just a reserved area of RAM, the CPU
ought to be able to access it quite nicely, and so we ought to be able
to do the software fallback straight in video memory, provided it is
serialized correctly between software and hardware touching
(gp_wait_until_idle() has been called, ourselves if necessary or via EXA
WaitMarker hook if this is/can be handled in generic EXA code).
That is, we shouldn't need to download this pixmap back to system memory
- but do the software fallback, that touches the pixmap whose latest
copy is in video memory, straight in video memory without migrating it
out first. (However, if the pixmap needing swfallback is already in
system memory, don't migrate to video memory first, etc)

How to achieve that is what I looked closely into during the hackathon
that Martin-Éric referred to.

EXA does have driver hooks available for such things, such as
PrepareAccess and FinishAccess, but these don't work in a manner that we
need. Basically in classic mode these are only used for the main
framebuffer pixmap, if I remember the details right.
So the alternative is to implement EXA "driver mode" or "mixed mode".
The difference with classic mode is primarily that either all or some of
the pixmap memory management lies on the shoulders of the driver code.

However, we should be in general quite fine with classic mode. As in, I
don't really want to implement our own memory management. Until we don't
have KMS, anyway (unsure of the details there).
So during the hackathon I looked into just calling pixman directly in
our Composite hooks, but we don't have all the parameters available to
us in the necessary form for pixman, and later on having caught Michel
Dänzer on IRC (one of the main developers of EXA generic code), he said
this can't really work anyway, and suggested alternatives described.

So my plan was to look into and improve the EXA generic code - to allow
the classic mode to work with software fallbacks in video memory, e.g
through the PrepareAccess and FinishAccess hook just returning right
away, or possibly ensuring a gp_wait_until_idle in PrepareAccess, and
making it all fit together and work correctly.
Once such work is done, we shouldn't see memcpy dominating the slow
performance cases anymore, and things ought to be a lot faster.

I do intend to continue this work, but I can not make any promises when
this happens, might be in a few weeks, might be in months, might be next
year or the year after... So definitely help is very much welcome in
this.

You might also look into what operations commonly can't be hardware
accelerated for your use cases. This is what we added GEODE_TRACE_FALL
for - if you defined that to 1 instead of 0 and recompile the driver
with that, it should spit out to Xorg.0.log what caused software
fallbacks. Then you can look closer into what parameters these
operations have, and think if you can implement hardware acceleration
for these cases somehow. If most common rendering operations you
encounter get hardware accelerated, you would side-step the pixmap
ping-pong issue. I believe the EXA generic code (libexa.so module found
in xserver tree exa/ subdir) also has such defines to simplify
debugging. If you can recompile xserver or somehow just the exa part,
then you might get better information out of there in addition - I
believe it had some functions to print out more details of what
parameters the fallback had.
One common fallback seems to be the "PICT_a8 as src or dst format is
unsupported", while this is quite unexplored per source code comments of
old. Maybe something can be done in that area.

You might also be able to employ some tricks if it is otherwise
impossible to hardware accelerate. E.g, if for the "Masks can be only
done with a 8bpp or 4bpp depth" case it is often a 1bpp mask (PICT_a1),
one could try converting it temporarily to PICT_a4 or PICT_a8 and
applying that in the hardware acceleration path instead. Of course such
converting could cost more than it helps in some situation, especially
after we get software fallbacks to work in video memory.

There are other rather advanced areas people could look into in the
future, but that's less gain for a lot of more work (e.g, I hear
research into Cimarron code and possibly rewriting it could be
beneficial; e.g, to batch more operations into one gfx command upload)

> Is it planned to introduce a KMS driver?

Lets say we are not opposed to it, if someone contributes towards
that ;)

I do not believe that KMS alone will give any performance benefits in
X.org usages by itself. The related memory management work forcing other
kind of pixmap management might help (maybe forced to use EXA mixed or
driver mode), but this could probably be done without KMS too. KMS of
course has its many other benefits.

I am not aware of any significant 2D acceleration present or supposed to
be present in KMS kernel code. Then again, I don't know much of KMS. I
am however quite sure that all drivers using KMS still do most of their
own hardware acceleration code.

I do think going KMS is useful, and that we should definitely do it,
given enough manpower.
It's probably quite useful for non-X.org use cases even, to replace lxfb
framebuffer driver, and then we can continue work on top of that to make
the X driver work with it.
Not sure if we can ditch UMS code paths anytime soon though then.

There are various places that allow hosting public git trees for the
initial kernel side of the work; so that others can see the initial
code, play with it and co-contribute to. So why not just use that option
for starters, no need to wait for arranging a tree to git.kernel.org or
elsewhere - it can always be pushed there later on.

> I hope to get some answers to all my questions.

Hopefully you got some answers from my overly long e-mail too; now time
for some hacking! ;)

Of course don't be afraid to ask more question.
Some of us also hang out on IRC, in the #geode channel on the FreeNode
network.

-- 
Regards,
Mart Raudsepp