SIMD-less render optimizations

Daniel Stone daniel at
Wed Apr 18 02:48:20 PDT 2007

Hi Daniel,

On Tue, Apr 17, 2007 at 10:45:57PM -0700, Daniel Amelang wrote:
> 1) Manually unroll the inner loop four times. This reduces the loop
> overhead (obviously). But also, by reading 4 pixels at a time, we
> could take advantage of the architecture's special sequential register
> load operations (like ldmia on ARM), which in this case can speed up
> reading 4 32-bit source pixels all in a row.
> 2) Once the loop is unrolled, reduce the number of read/writes to
> memory by clustering every two 16-bit dest pixel read and write
> together in a single 32-bit read/write. This reduced the overall
> number of memory accesses by 33%.

To be fair, this is kind of a special case, because the N800's memory is
shockingly slow. :)

> 4) Use a macro (or inline function) instead of a per-pixel function call.

Indeed, this is true for a lot of fb/.

> So the patch is attached in case someone is interested. I've tried to
> update my code to work with the latest code in git, but things have
> changed a bit since, so beware. It could use a couple good
> code reviews.

I'll merge it into my N800 tree, and if no-one objects, we should
merge it into mainline as well.

Thanks very much for doing all this!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <>

More information about the xorg mailing list