SIMD-less render optimizations
daniel at fooishbar.org
Wed Apr 18 02:48:20 PDT 2007
On Tue, Apr 17, 2007 at 10:45:57PM -0700, Daniel Amelang wrote:
> 1) Manually unroll the inner loop four times. This reduces the loop
> overhead (obviously). But also, by reading 4 pixels at a time, we
> could take advantage of the architecture's special sequential register
> load operations (like ldmia on ARM), which in this case can speed up
> reading 4 32-bit source pixels all in a row.
> 2) Once the loop is unrolled, reduce the number of read/writes to
> memory by clustering every two 16-bit dest pixel read and write
> together in a single 32-bit read/write. This reduced the overall
> number of memory accesses by 33%.
To be fair, this is kind of a special case, because the N800's memory is
shockingly slow. :)
> 4) Use a macro (or inline function) instead of a per-pixel function call.
Indeed, this is true for a lot of fb/.
> So the patch is attached in case someone is interested. I've tried to
> update my code to work with the latest code in git, but things have
> changed a bit since 22.214.171.124, so beware. It could use a couple good
> code reviews.
I'll merge it into my N800 tree, and if no-one objects, we should
merge it into mainline as well.
Thanks very much for doing all this!
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 189 bytes
Desc: Digital signature
More information about the xorg