[cairo] [PATCH] SSE2 support to pixman

Thu Mar 13 13:26:50 PDT 2008

"André Tupinambá" <andrelrt at gmail.com> writes:

> I just finnished the patch to the pixman library to add SSE2 support.
> The patch was made using Kumpera's files and my proof of concept.
> 
> I ran the cairo's tests and perf, and everything seems to be ok.

Overall, this looks great. It's well-written, and GCC actually
generates decent code for the intrinsics. However, when I tested this
with cairo-perf it came out slower than the MMX code. Here are the
numbers I get:

Before:
c-24-61-65-93:~/cairo/perf% env CAIRO_TEST_TARGET=image ./cairo-perf -i 5000 paint_image_rgba_over 
[ # ]  backend-content                    test-size min(ticks)  min(ms) median(ms) stddev. iterations
[  0]    image-rgba       paint_image_rgba_over-256    2426056    0.810    0.830  0.93% 4841
[  1]    image-rgba       paint_image_rgba_over-512    9577644    3.199    3.242  0.52% 4708
[  0]    image-rgb        paint_image_rgba_over-256    3023492    1.010    1.030  0.70% 4780
[  1]    image-rgb        paint_image_rgba_over-512    9587464    3.202    3.242  0.49% 4747

After:
c-24-61-65-93:~/cairo/perf% env CAIRO_TEST_TARGET=image ./cairo-perf -i 5000 paint_image_rgba_over
[ # ]  backend-content                    test-size min(ticks)  min(ms) median(ms) stddev. iterations
[  0]    image-rgba       paint_image_rgba_over-256    3857756    1.288    1.297  0.35% 4207
[  1]    image-rgba       paint_image_rgba_over-512   15787128    5.270    5.287  0.12% 4140
[  0]    image-rgb        paint_image_rgba_over-256    4169408    1.392    1.409  0.56% 4019
[  1]    image-rgb        paint_image_rgba_over-512   15727612    5.250    5.267  0.12% 4339

c-24-61-65-93:~/cairo/perf% ./cairo-perf-diff old.perf new.perf 
old: old
new: new
Slowdowns
=========
image-rgba      paint_image_rgba_over-512    3.24 0.52% ->   5.29 0.12%:  1.65x slowdown
image-rgb       paint_image_rgba_over-512    3.24 0.49% ->   5.27 0.12%:  1.64x slowdown
image-rgba      paint_image_rgba_over-256    0.83 0.93% ->   1.30 0.35%:  1.59x slowdown
image-rgb       paint_image_rgba_over-256    1.03 0.70% ->   1.41 0.56%:  1.38x slowdown

Would you mind posting the numbers you got?

I suspect two things going on:

(1) I don't think streaming writes are appropriate here. The problem
is that they force the cache line in question of the cache hierarchy
altogether. For a function like this one, this means that basically
every destination read will be uncached due to the previous iteration
having used a streaming write.

So I'd suggest to simply use save128Aligned() instead.

(2) The MMX version is careful to avoid reading from the destination
whenever the source pixels are fully opaque. My experience is that
this is enough of a win that it easily pays for the check.

SSE2 has pretty good support for this. We can use something like this
function:

    static inline int
    is_opaque (__m128i src)
    {
        __m128i alpha = _mm_and_si128 (src, Maskff000000);
        __m128i cmp = _mm_cmpeq_epi8 (alpha, Maskff000000);
        int x = _mm_movemask_epi8 (cmp);

        return x == 0xffff;
    }

where Maskff000000 is four copies of 0xff000000.

Thanks,
Soren