[cairo] New ARMv7-A (NEON) optimisations for Pixman
Jeff Muizelaar
jeff at infidigm.net
Mon May 11 12:34:47 PDT 2009
On Mon, May 11, 2009 at 09:07:59AM +0000, Jonathan Morton wrote:
> On Fri, 2009-05-08 at 14:10 -0400, Jeff Muizelaar wrote:
> > On Fri, May 08, 2009 at 11:26:13AM +0000, Jonathan Morton wrote:
> > > +#ifdef USE_GCC_INLINE_ASM
> > > + { PIXMAN_OP_SRC, PIXMAN_r5g6b5, PIXMAN_null, PIXMAN_r5g6b5, fbCompositeSrc_16x16neon, 0 },
> > > + { PIXMAN_OP_SRC, PIXMAN_b5g6r5, PIXMAN_null, PIXMAN_b5g6r5, fbCompositeSrc_16x16neon, 0 },
> > > + { PIXMAN_OP_OVER, PIXMAN_r5g6b5, PIXMAN_null, PIXMAN_r5g6b5, fbCompositeSrc_16x16neon, 0 },
> > > + { PIXMAN_OP_OVER, PIXMAN_b5g6r5, PIXMAN_null, PIXMAN_b5g6r5, fbCompositeSrc_16x16neon, 0 },
> > > + { PIXMAN_OP_SRC, PIXMAN_a8r8g8b8, PIXMAN_null, PIXMAN_r5g6b5, fbCompositeSrc_24x16neon, 0 },
> > > + { PIXMAN_OP_SRC, PIXMAN_a8b8g8r8, PIXMAN_null, PIXMAN_b5g6r5, fbCompositeSrc_24x16neon, 0 },
> > > + { PIXMAN_OP_SRC, PIXMAN_x8r8g8b8, PIXMAN_null, PIXMAN_r5g6b5, fbCompositeSrc_24x16neon, 0 },
> > > + { PIXMAN_OP_SRC, PIXMAN_x8b8g8r8, PIXMAN_null, PIXMAN_b5g6r5, fbCompositeSrc_24x16neon, 0 },
> >
> > Doesn't fbCompositeSrc_24x16neon implement the same operation as
> > fbCompositeSrc_x888x0565neon?
> >
> > How does the performance of those two implementations compare?
>
> I'd forgotten that was there in Ian's stuff. The earlier entry in the
> fastpath table would take precedence, right?
>
> I can't be very precise with the numbers, as I'm testing on customer
> hardware, but my code is "noticeably" faster than Ian's (meaning at
> least 10% better) for the large areas typical of whole-window transfers
> and pictures. This is true on both uncached and shadowed framebuffers,
> and is quite repeatable.
>
> I think this is mostly down to the cache-preloading of the source data
> that I do and Ian doesn't - we're operating quite close to the memory
> bandwidth here (assuming the destination is at least write-combined), so
> latency hiding is a Good Thing.
>
> The difference is also positive on small areas, such as 32x32, though
> the difference is small because overhead elsewhere dominates. I haven't
> measured on very very narrow images, but I would imagine that the same
> principle holds.
>
> Another valid point would be that Ian's code works on armcc, and mine
> doesn't. As such, it's admittedly not very helpful to have two totally
> different routines doing the same thing for armcc and gcc. But if
> somebody would like to write an intrinsics version of my code, perhaps
> that would resolve it. I'd do it, but I haven't got a copy of armcc.
As far as I know the intrinsics should work on gcc, they just aren't
very fast. So if you write add a version that uses intrinsics, choosing
yours over Ian's becomes pretty easy :)
-Jeff
More information about the cairo
mailing list