[cairo] [RFC] Pixman & compositing with overlapping source and destination pixel data
Siarhei Siamashka
siarhei.siamashka at gmail.com
Mon Oct 19 16:33:44 PDT 2009
On Tuesday 20 October 2009, Soeren Sandmann wrote:
[...]
> > I'm not sure about pixman_gc_t since most of the needed operations are
> > just simple copies. What about starting with just introducing a variant
> > of 'pixman_blt' which is overlapping aware?
>
> The pixman_blt() interface is misdesigned for two reasons: (1) the
> strides are given in number-of-uint32_ts, which gratuitously limits
> the types of images that can be processed, and (2) it can fail if it
> doesn't like the input for some reason.
>
> At the same time, having the core primitives available on the client
> side is useful in some cases, and the software implementation of them
> can more easily be optimized with SIMD instructions in pixman.
>
> Moving core rendering into pixman solves both issues at the same time.
I don't have any strong opinion about API updates. In any case, smooth
upgrade path needs to be taken care of and the users should be prevented
from using incompatible versions of client applications/libraries and
pixmap. An introduction of a new function may be the best way, it can also
solve some of the design issues.
> But that said, I am not opposed to extending pixman_blt() to support
> overlapping copies. That is certainly a simpler first step.
Yes, the functionality itself can be introduced first (without breaking
anything). Wrapping it into a better API can be done as the natural next
step.
> > I created a work-in-progress branch with 'pixman_blt' function (generic C
> > implementation for now) extended to support overlapped source/destination
> > case. A simple test program is also included:
> > http://cgit.freedesktop.org/~siamashka/pixman/log/?h=overlapped-blt
> >
> > Making use of the already existing SIMD optimized pixel copy functions
> > should provide fast scrolling in all the directions except for from left
> > to right. This special case will require a SIMD optimized backwards copy.
> >
> > I wonder if it makes sense to drop delegates support for pixman_blt and
> > make call chain shorter when introducing SIMD optimized copies? It seems
> > to be a little bit overdesigned here.
>
> How would you support SSE2 and MMX in the same binary then?
The most simple way is to do it in my opinion is the following.
First introduce something like 'pixman_init' function. Right now CPU type
detection is done on the first call to the function. It introduces some
minor overhead by having an extra pointer check on each function call.
Another problem is that we can't be completely sure that CPU capabilities
detection check is always fully reentrant. For example, some platforms may
try to set a signal handler and expect to catch SIGILL or something like
this.
This initialization function would just detect CPU capabilities and set some
function pointers. The whole CPU-specific implementation of 'pixman_blt'
may be just called via this pointer directly by a client. Or 'pixman_blt' can
be just a small thunk which does a call via function pointer, passes exactly
the same arguments to it and does nothing more. In this case there will be
really no excuse for the compiler for not using tail call, see below.
> Also, I really don't see much potential for saving here. For a NEON
> implementation of blt, the callchain would be:
>
> pixman_blt() -> _pixman_implementation_blt() -> neon_blt()
>
> and getting rid of delegates wouldn't really affect that at all. You
> could get rid of the _pixman_implementation_blt() call by making it a
> macro, but as I mentioned before, gcc turns it into a tail call that
> reused the arguments on the stack, so the overhead really is minimal.
On what kind of platform and with which version of gcc are you getting
proper tail call here? I don't see it being used and the overhead is
rather hefty, which is also confirmed by benchmarking and profiling.
Even if gcc can reuse some part of the arguments which are already on
stack in some cases, different platforms may have different ABI and calling
conventions. For example, for ARM and x86-64, the first few arguments
are passed in registers, the rest is on stack. Relying on the compiler to
always do the job properly identifying tail call possibilities in all cases
may be not the very best idea.
C:
PIXMAN_EXPORT pixman_bool_t
pixman_blt (uint32_t *src_bits,
uint32_t *dst_bits,
int src_stride,
int dst_stride,
int src_bpp,
int dst_bpp,
int src_x,
int src_y,
int dst_x,
int dst_y,
int width,
int height)
{
if (!imp)
imp = _pixman_choose_implementation ();
return _pixman_implementation_blt (imp, src_bits, dst_bits, src_stride,
dst_stride,
src_bpp, dst_bpp,
src_x, src_y,
dst_x, dst_y,
width, height);
}
x86, gcc 4.3.2:
00000420 <pixman_blt>:
420: 55 push %ebp
421: 89 e5 mov %esp,%ebp
423: 83 ec 38 sub $0x38,%esp
426: 8b 15 00 00 00 00 mov 0x0,%edx
42c: 85 d2 test %edx,%edx
42e: 74 68 je 498 <pixman_blt+0x78>
430: 8b 45 34 mov 0x34(%ebp),%eax
433: 89 44 24 30 mov %eax,0x30(%esp)
437: 8b 45 30 mov 0x30(%ebp),%eax
43a: 89 44 24 2c mov %eax,0x2c(%esp)
43e: 8b 45 2c mov 0x2c(%ebp),%eax
441: 89 44 24 28 mov %eax,0x28(%esp)
445: 8b 45 28 mov 0x28(%ebp),%eax
448: 89 44 24 24 mov %eax,0x24(%esp)
44c: 8b 45 24 mov 0x24(%ebp),%eax
44f: 89 44 24 20 mov %eax,0x20(%esp)
453: 8b 45 20 mov 0x20(%ebp),%eax
456: 89 44 24 1c mov %eax,0x1c(%esp)
45a: 8b 45 1c mov 0x1c(%ebp),%eax
45d: 89 44 24 18 mov %eax,0x18(%esp)
461: 8b 45 18 mov 0x18(%ebp),%eax
464: 89 44 24 14 mov %eax,0x14(%esp)
468: 8b 45 14 mov 0x14(%ebp),%eax
46b: 89 44 24 10 mov %eax,0x10(%esp)
46f: 8b 45 10 mov 0x10(%ebp),%eax
472: 89 44 24 0c mov %eax,0xc(%esp)
476: 8b 45 0c mov 0xc(%ebp),%eax
479: 89 44 24 08 mov %eax,0x8(%esp)
47d: 8b 45 08 mov 0x8(%ebp),%eax
480: 89 44 24 04 mov %eax,0x4(%esp)
484: a1 00 00 00 00 mov 0x0,%eax
489: 89 04 24 mov %eax,(%esp)
48c: e8 fc ff ff ff call 48d <pixman_blt+0x6d>
491: c9 leave
492: c3 ret
493: 90 nop
494: 8d 74 26 00 lea 0x0(%esi),%esi
498: e8 fc ff ff ff call 499 <pixman_blt+0x79>
49d: a3 00 00 00 00 mov %eax,0x0
4a2: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
4a8: eb 86 jmp 430 <pixman_blt+0x10>
4aa: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
ARM, gcc 4.3.4:
000003a4 <pixman_blt>:
3a4: e92d41f0 push {r4, r5, r6, r7, r8, lr}
3a8: e59f4088 ldr r4, [pc, #136] ; 438 <pixman_blt+0x94>
3ac: e1a08001 mov r8, r1
3b0: e24dd028 sub sp, sp, #40 ; 0x28
3b4: e5941000 ldr r1, [r4]
3b8: e1a06000 mov r6, r0
3bc: e3510000 cmp r1, #0 ; 0x0
3c0: e1a07002 mov r7, r2
3c4: e1a05003 mov r5, r3
3c8: 0a000017 beq 42c <pixman_blt+0x88>
3cc: e59dc040 ldr ip, [sp, #64]
3d0: e5940000 ldr r0, [r4]
3d4: e59de044 ldr lr, [sp, #68]
3d8: e58dc004 str ip, [sp, #4]
3dc: e59dc048 ldr ip, [sp, #72]
3e0: e58de008 str lr, [sp, #8]
3e4: e59de04c ldr lr, [sp, #76]
3e8: e58dc00c str ip, [sp, #12]
3ec: e59dc050 ldr ip, [sp, #80]
3f0: e58de010 str lr, [sp, #16]
3f4: e59d405c ldr r4, [sp, #92]
3f8: e58dc014 str ip, [sp, #20]
3fc: e59de054 ldr lr, [sp, #84]
400: e59dc058 ldr ip, [sp, #88]
404: e1a01006 mov r1, r6
408: e1a02008 mov r2, r8
40c: e1a03007 mov r3, r7
410: e58d5000 str r5, [sp]
414: e58de018 str lr, [sp, #24]
418: e58dc01c str ip, [sp, #28]
41c: e58d4020 str r4, [sp, #32]
420: ebfffffe bl 0 <_pixman_implementation_blt>
424: e28dd028 add sp, sp, #40 ; 0x28
428: e8bd81f0 pop {r4, r5, r6, r7, r8, pc}
42c: ebfffffe bl 0 <_pixman_choose_implementation>
430: e5840000 str r0, [r4]
434: eaffffe4 b 3cc <pixman_blt+0x28>
438: 00000000 .word 0x00000000
--
Best regards,
Siarhei Siamashka
More information about the xorg-devel
mailing list