[cairo] [RFC] Pixman & compositing with overlapping source and destination pixel data

Mon Oct 19 16:33:44 PDT 2009

On Tuesday 20 October 2009, Soeren Sandmann wrote:
[...]
> > I'm not sure about pixman_gc_t since most of the needed operations are
> > just simple copies. What about starting with just introducing a variant
> > of 'pixman_blt' which is overlapping aware?
>
> The pixman_blt() interface is misdesigned for two reasons: (1) the
> strides are given in number-of-uint32_ts, which gratuitously limits
> the types of images that can be processed, and (2) it can fail if it
> doesn't like the input for some reason.
>
> At the same time, having the core primitives available on the client
> side is useful in some cases, and the software implementation of them
> can more easily be optimized with SIMD instructions in pixman.
>
> Moving core rendering into pixman solves both issues at the same time.

I don't have any strong opinion about API updates. In any case, smooth
upgrade path needs to be taken care of and the users should be prevented
from using incompatible versions of client applications/libraries and
pixmap. An introduction of a new function may be the best way, it can also
solve some of the design issues.

> But that said, I am not opposed to extending pixman_blt() to support
> overlapping copies. That is certainly a simpler first step.

Yes, the functionality itself can be introduced first (without breaking
anything). Wrapping it into a better API can be done as the natural next
step.

> > I created a work-in-progress branch with 'pixman_blt' function (generic C
> > implementation for now) extended to support overlapped source/destination
> > case. A simple test program is also included:
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
> >
> > Making use of the already existing SIMD optimized pixel copy functions
> > should provide fast scrolling in all the directions except for from left
> > to right. This special case will require a SIMD optimized backwards copy.
> >
> > I wonder if it makes sense to drop delegates support for pixman_blt and
> > make call chain shorter when introducing SIMD optimized copies? It seems
> > to be a little bit overdesigned here.
>
> How would you support SSE2 and MMX in the same binary then?

The most simple way is to do it in my opinion is the following.

First introduce something like 'pixman_init' function. Right now CPU type
detection is done on the first call to the function. It introduces some
minor overhead by having an extra pointer check on each function call.
Another problem is that we can't be completely sure that CPU capabilities
detection check is always fully reentrant. For example, some platforms may
try to set a signal handler and expect to catch SIGILL or something like
this.

This initialization function would just detect CPU capabilities and set some
function pointers. The whole CPU-specific implementation of 'pixman_blt'
may be just called via this pointer directly by a client. Or 'pixman_blt' can
be just a small thunk which does a call via function pointer, passes exactly
the same arguments to it and does nothing more. In this case there will be
really no excuse for the compiler for not using tail call, see below.

> Also, I really don't see much potential for saving here. For a NEON
> implementation of blt, the callchain would be:
>
>    pixman_blt() ->  _pixman_implementation_blt() -> neon_blt()
>
> and getting rid of delegates wouldn't really affect that at all. You
> could get rid of the _pixman_implementation_blt() call by making it a
> macro, but as I mentioned before, gcc turns it into a tail call that
> reused the arguments on the stack, so the overhead really is minimal.

On what kind of platform and with which version of gcc are you getting
proper tail call here? I don't see it being used and the overhead is
rather hefty, which is also confirmed by benchmarking and profiling.

Even if gcc can reuse some part of the arguments which are already on
stack in some cases, different platforms may have different ABI and calling
conventions. For example, for ARM and x86-64, the first few arguments
are passed in registers, the rest is on stack. Relying on the compiler to
always do the job properly identifying tail call possibilities in all cases
may be not the very best idea.

C:

PIXMAN_EXPORT pixman_bool_t
pixman_blt (uint32_t *src_bits,
            uint32_t *dst_bits,
            int       src_stride,
            int       dst_stride,
            int       src_bpp,
            int       dst_bpp,
            int       src_x,
            int       src_y,
            int       dst_x,
            int       dst_y,
            int       width,
            int       height)
{
    if (!imp)
        imp = _pixman_choose_implementation ();

    return _pixman_implementation_blt (imp, src_bits, dst_bits, src_stride,
                                       dst_stride,
                                       src_bpp, dst_bpp,
                                       src_x, src_y,
                                       dst_x, dst_y,
                                       width, height);
}

x86, gcc 4.3.2:

00000420 <pixman_blt>:
 420:   55                      push   %ebp
 421:   89 e5                   mov    %esp,%ebp
 423:   83 ec 38                sub    $0x38,%esp
 426:   8b 15 00 00 00 00       mov    0x0,%edx
 42c:   85 d2                   test   %edx,%edx
 42e:   74 68                   je     498 <pixman_blt+0x78>
 430:   8b 45 34                mov    0x34(%ebp),%eax
 433:   89 44 24 30             mov    %eax,0x30(%esp)
 437:   8b 45 30                mov    0x30(%ebp),%eax
 43a:   89 44 24 2c             mov    %eax,0x2c(%esp)
 43e:   8b 45 2c                mov    0x2c(%ebp),%eax
 441:   89 44 24 28             mov    %eax,0x28(%esp)
 445:   8b 45 28                mov    0x28(%ebp),%eax
 448:   89 44 24 24             mov    %eax,0x24(%esp)
 44c:   8b 45 24                mov    0x24(%ebp),%eax
 44f:   89 44 24 20             mov    %eax,0x20(%esp)
 453:   8b 45 20                mov    0x20(%ebp),%eax
 456:   89 44 24 1c             mov    %eax,0x1c(%esp)
 45a:   8b 45 1c                mov    0x1c(%ebp),%eax
 45d:   89 44 24 18             mov    %eax,0x18(%esp)
 461:   8b 45 18                mov    0x18(%ebp),%eax
 464:   89 44 24 14             mov    %eax,0x14(%esp)
 468:   8b 45 14                mov    0x14(%ebp),%eax
 46b:   89 44 24 10             mov    %eax,0x10(%esp)
 46f:   8b 45 10                mov    0x10(%ebp),%eax
 472:   89 44 24 0c             mov    %eax,0xc(%esp)
 476:   8b 45 0c                mov    0xc(%ebp),%eax
 479:   89 44 24 08             mov    %eax,0x8(%esp)
 47d:   8b 45 08                mov    0x8(%ebp),%eax
 480:   89 44 24 04             mov    %eax,0x4(%esp)
 484:   a1 00 00 00 00          mov    0x0,%eax
 489:   89 04 24                mov    %eax,(%esp)
 48c:   e8 fc ff ff ff          call   48d <pixman_blt+0x6d>
 491:   c9                      leave
 492:   c3                      ret
 493:   90                      nop
 494:   8d 74 26 00             lea    0x0(%esi),%esi
 498:   e8 fc ff ff ff          call   499 <pixman_blt+0x79>
 49d:   a3 00 00 00 00          mov    %eax,0x0
 4a2:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
 4a8:   eb 86                   jmp    430 <pixman_blt+0x10>
 4aa:   8d b6 00 00 00 00       lea    0x0(%esi),%esi

ARM, gcc 4.3.4:

000003a4 <pixman_blt>:
 3a4:   e92d41f0        push    {r4, r5, r6, r7, r8, lr}
 3a8:   e59f4088        ldr     r4, [pc, #136]  ; 438 <pixman_blt+0x94>
 3ac:   e1a08001        mov     r8, r1
 3b0:   e24dd028        sub     sp, sp, #40     ; 0x28
 3b4:   e5941000        ldr     r1, [r4]
 3b8:   e1a06000        mov     r6, r0
 3bc:   e3510000        cmp     r1, #0  ; 0x0
 3c0:   e1a07002        mov     r7, r2
 3c4:   e1a05003        mov     r5, r3
 3c8:   0a000017        beq     42c <pixman_blt+0x88>
 3cc:   e59dc040        ldr     ip, [sp, #64]
 3d0:   e5940000        ldr     r0, [r4]
 3d4:   e59de044        ldr     lr, [sp, #68]
 3d8:   e58dc004        str     ip, [sp, #4]
 3dc:   e59dc048        ldr     ip, [sp, #72]
 3e0:   e58de008        str     lr, [sp, #8]
 3e4:   e59de04c        ldr     lr, [sp, #76]
 3e8:   e58dc00c        str     ip, [sp, #12]
 3ec:   e59dc050        ldr     ip, [sp, #80]
 3f0:   e58de010        str     lr, [sp, #16]
 3f4:   e59d405c        ldr     r4, [sp, #92]
 3f8:   e58dc014        str     ip, [sp, #20]
 3fc:   e59de054        ldr     lr, [sp, #84]
 400:   e59dc058        ldr     ip, [sp, #88]
 404:   e1a01006        mov     r1, r6
 408:   e1a02008        mov     r2, r8
 40c:   e1a03007        mov     r3, r7
 410:   e58d5000        str     r5, [sp]
 414:   e58de018        str     lr, [sp, #24]
 418:   e58dc01c        str     ip, [sp, #28]
 41c:   e58d4020        str     r4, [sp, #32]
 420:   ebfffffe        bl      0 <_pixman_implementation_blt>
 424:   e28dd028        add     sp, sp, #40     ; 0x28
 428:   e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}
 42c:   ebfffffe        bl      0 <_pixman_choose_implementation>
 430:   e5840000        str     r0, [r4]
 434:   eaffffe4        b       3cc <pixman_blt+0x28>
 438:   00000000        .word   0x00000000

-- 
Best regards,
Siarhei Siamashka