SIMD-less render optimizations

Tue Apr 17 22:45:57 PDT 2007

Hello all,

I was recently asking myself "how fast can we make the render
operations w/out resorting to specialized SIMD code paths?" Around the
same time, some developers at Nokia were complaining about cairo being
slow on a certain render operation on the N800 (which doesn't have any
SIMD instructions that would help render operations). So, I decided to
kill two birds with one stone and see how fast I could make a single
render case using only hardware-independent C code.

Long story short, I got a 1.9x speedup for the cairo-perf test
paint_similar_rgba_over-512 using the xlib-rgb backend on the N800
(which uses a 16-bit color framebuffer and runs xserver 1.1.99.3) and
the patch is attached. Here's the long story:

The function I chose to concentrate on was fbCompositeSrc_8888x0565,
which performs a 32-bit OVER 16-bit (ignore the 'Src' part of the
function name; it has nothing to do with the SOURCE render operation).
This was my choice because 1) the n800 uses r5g6b5 as the framebuffer
format 2) it was already a "specialized" function, meaning that I
wasn't cheating by choosing a render operation that currently falls
back to fbCompositeGeneral, which would be too easy to speed up and 3)
it's a simple operation (no mask) so it would make my job easier.

Initially, I tried inlining the per-pixel functions like fbOver and
friends, but the speedup was minimal (~1.1x), so then I tried
compiling -O3 -funroll-loops and other gcc variants just to get a feel
for what would help, but didn't get anything useful either. So then I
decided to really dig in.

After extensive trial and error, I found that the following code
optimizations, when combined, get you a nice speedup (1.3x - 2.5x
depending on the input data, 1.9x typical):

1) Manually unroll the inner loop four times. This reduces the loop
overhead (obviously). But also, by reading 4 pixels at a time, we
could take advantage of the architecture's special sequential register
load operations (like ldmia on ARM), which in this case can speed up
reading 4 32-bit source pixels all in a row.

2) Once the loop is unrolled, reduce the number of read/writes to
memory by clustering every two 16-bit dest pixel read and write
together in a single 32-bit read/write. This reduced the overall
number of memory accesses by 33%.

3) Look ahead 4 pixels at a time for full opacity or full
transparency. Many images are composed of large regions of full
opacity or full transparency. We can perform a quick check of all four
source pixels and either skip ahead four (in the transparency case) or
perform a fast SOURCE with all four (in the fully opaque case).

4) Use a macro (or inline function) instead of a per-pixel function call.

5) Simplify the per pixel unpack/packing. The current code will unpack
the 565 pixel to a 888 pixel, then it turns right around and extracts
each color component from the 888 pixel. You can save a number of
shifts by unpacking directly into the color components, skipping the
intermediate packing of the 888 pixel. Same thing with the packing
part that comes after the blend; you can go directly from the blend
result to the 565 pixel, skipping the packing of the intermediate 888
pixel.

6) Reduce the three adds, one subtract and one multiply that is
performed per color component into a single multiple-add sequence
(which usually translates to a single fast mac/mla instruction). Yes,
you may be 1 bit off (AFAIK) in the result here in some cases, but
we're going to chop off the last 2-3 bits anyway when we convert to
565, so the 1 bit difference (when present) rarely affects the overall
result in our case. Plus, most devices that use 16-bit color (embedded
devices, OLPC, etc) would gladly give up a little accuracy for speed.
Yes, errors can accumulate, but I haven't seen a popular application
yet that runs on these machines that blends repeatedly enough to
accumulate this error to the point that it is visible. But you guys
probably have a better feel for that than I do.

It should be noted that the MMX fast paths already incorporate a
couple of the above optimizations, so this isn't entirely new. Some of
them just need "backporting" to the plain C fast paths.

So the patch is attached in case someone is interested. I've tried to
update my code to work with the latest code in git, but things have
changed a bit since 1.1.99.3, so beware. It could use a couple good
code reviews.

But, the real point of this exercise was to find out what smarts I
should build into my dynamic code generator :)

Dan Amelang
-------------- next part --------------
From 54cc61bb0048276bf91f3f28e5a4c1b50572b5e4 Mon Sep 17 00:00:00 2001
From: Dan Amelang <dan at amelang.net>
Date: Tue, 17 Apr 2007 22:35:30 -0700
Subject: [PATCH] Incorporate various optimizations in fbCompositeSrc_8888x0565

I get a 1.9x speedup for the cairo-perf test "paint_similar_rgba_over-512"
using the xlib-rgb backend on the Nokia N800 (which uses a 16-bit color
framebuffer and runs xserver 1.1.99.3). See long explanation on the
xorg mailing list for details.
---
 fb/fbpict.c |  115 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 89 insertions(+), 26 deletions(-)

diff --git a/fb/fbpict.c b/fb/fbpict.c
index cd6cac2..1ad5c8b 100644
--- a/fb/fbpict.c
+++ b/fb/fbpict.c
@@ -542,6 +542,34 @@ fbCompositeSrc_8888x0888 (CARD8      op,
     fbFinishAccess (pDst->pDrawable);
 }
 
+#define FbOverU_8888x565(s, d) \
+        \
+        /* Extract alpha */ \
+        reverse_a = ~(s) >> 24; \
+        \
+        /* Extract r8g8b8 color channels (as 8.8 fixed point) */ \
+        s_r  = (((s) >> 8) & 0xff00); \
+        s_g  = (((s)     ) & 0xff00); \
+        s_b  = (((s) << 8)         ); \
+        \
+        /* Extract r5g6b5 color channels */ \
+        d_r = ((d) >> 8) & 0xf8; \
+        d_g = ((d) >> 3) & 0xfc; \
+        d_b = ((d) << 3) & 0xf8; \
+        \
+        /* Use higher bits of the r5 to fill out the bottom of the r8 */ \
+        d_r |= (d_r >> 5); \
+        d_g |= (d_g >> 6); \
+        d_b |= (d_b >> 5); \
+        \
+        /* Blend */ \
+        d_r = d_r * reverse_a + s_r; \
+        d_g = d_g * reverse_a + s_g; \
+        d_b = d_b * reverse_a + s_b; \
+        \
+        /* Pack result as r5g6b5 */ \
+        (d) = (d_r & 0xf800) | ((d_g & 0xfc00) >> 5) | (d_b >> 11)
+
 void
 fbCompositeSrc_8888x0565 (CARD8      op,
 			 PicturePtr pSrc,
@@ -557,40 +585,75 @@ fbCompositeSrc_8888x0565 (CARD8      op,
 			 CARD16     height)
 {
     CARD16	*dstLine, *dst;
-    CARD32	d;
-    CARD32	*srcLine, *src, s;
-    CARD8	a;
+    CARD32	*srcLine, *src;
     FbStride	dstStride, srcStride;
-    CARD16	w;
+    int         w;
 
     fbComposeGetStart (pSrc, xSrc, ySrc, CARD32, srcStride, srcLine, 1);
     fbComposeGetStart (pDst, xDst, yDst, CARD16, dstStride, dstLine, 1);
 
     while (height--)
     {
-	dst = dstLine;
-	dstLine += dstStride;
-	src = srcLine;
-	srcLine += srcStride;
-	w = width;
+        CARD32 s1, s2, s3, s4;
+        CARD16 d_r, d_g, d_b, s_r, s_g, s_b, reverse_a;
+        CARD32 *dst_2px_wide;
+
+        src = srcLine;
+        srcLine += srcStride;
+        dst_2px_wide = (CARD32 *) dstLine;
+        dstLine += dstStride;
+	w = width - 4;
+
+        while (w >= 0)
+        {
+            s1 = *src;
+            s2 = *(src + 1);
+            s3 = *(src + 2);
+            s4 = *(src + 3);
+
+            w -= 4;
+            src += 4;
+
+            /* Check if the next 4 pixels are opaque */
+            if ((s1 & s2 & s3 & s4) > 0xfeffffff)
+            {
+                /* In this case, we just perform a SOURCE for all 4 pixels */
+                *dst_2px_wide++ = (cvt8888to0565 (s1) << 16) | cvt8888to0565 (s2);
+                *dst_2px_wide++ = (cvt8888to0565 (s3) << 16) | cvt8888to0565 (s4);
+            }
+            /* Next, check if the next 4 pixels have any alpha in them at all */
+            else if ((s1 | s2 | s3 | s4) > 0x00ffffff)
+            {
+                /* In which case, we perform OVER on each one of them */
+                CARD32 d1, d2, d3, d4;
+
+                d1 = (*dst_2px_wide >> 16);
+                d2 = (*dst_2px_wide & 0xffff);
+                FbOverU_8888x565 (s1, d1);
+                FbOverU_8888x565 (s2, d2);
+                *dst_2px_wide++ = (d1 << 16) | d2;
+
+                d3 = (*dst_2px_wide >> 16);
+                d4 = (*dst_2px_wide & 0xffff);
+                FbOverU_8888x565 (s3, d3);
+                FbOverU_8888x565 (s4, d4);
+                *dst_2px_wide++ = (d3 << 16) | d4;
+            }
+            else
+            {
+                /* Do nothing, since the source pixels are all transparent */
+                dst_2px_wide += 2;
+            }
+        }
 
-	while (w--)
-	{
-	    s = READ(src++);
-	    a = s >> 24;
-	    if (a)
-	    {
-		if (a == 0xff)
-		    d = s;
-		else
-		{
-		    d = READ(dst);
-		    d = fbOver24 (s, cvt0565to8888(d));
-		}
-		WRITE(dst, cvt8888to0565(d));
-	    }
-	    dst++;
-	}
+        /* Deal with left over pixels */
+        for (dst = (CARD16 *) dst_2px_wide; w > -4; w--)
+        {
+            CARD32 d = *dst;
+            CARD32 s = *src++;
+            FbOverU_8888x565 (s, d);
+            *dst++ = d;
+        }
     }
 
     fbFinishAccess (pDst->pDrawable);
-- 
1.4.4.2