[cairo] New ARMv7-A (NEON) optimisations for Pixman

Mon May 11 12:34:47 PDT 2009

On Mon, May 11, 2009 at 09:07:59AM +0000, Jonathan Morton wrote:
> On Fri, 2009-05-08 at 14:10 -0400, Jeff Muizelaar wrote:
> > On Fri, May 08, 2009 at 11:26:13AM +0000, Jonathan Morton wrote:
> > > +#ifdef USE_GCC_INLINE_ASM
> > > +    { PIXMAN_OP_SRC,  PIXMAN_r5g6b5,   PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_16x16neon,              0 },
> > > +    { PIXMAN_OP_SRC,  PIXMAN_b5g6r5,   PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_16x16neon,              0 },
> > > +    { PIXMAN_OP_OVER, PIXMAN_r5g6b5,   PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_16x16neon,              0 },
> > > +    { PIXMAN_OP_OVER, PIXMAN_b5g6r5,   PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_16x16neon,              0 },
> > > +    { PIXMAN_OP_SRC,  PIXMAN_a8r8g8b8, PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_24x16neon,              0 },
> > > +    { PIXMAN_OP_SRC,  PIXMAN_a8b8g8r8, PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_24x16neon,              0 },
> > > +    { PIXMAN_OP_SRC,  PIXMAN_x8r8g8b8, PIXMAN_null,     PIXMAN_r5g6b5,   fbCompositeSrc_24x16neon,              0 },
> > > +    { PIXMAN_OP_SRC,  PIXMAN_x8b8g8r8, PIXMAN_null,     PIXMAN_b5g6r5,   fbCompositeSrc_24x16neon,              0 },
> > 
> > Doesn't fbCompositeSrc_24x16neon implement the same operation as
> > fbCompositeSrc_x888x0565neon?
> > 
> > How does the performance of those two implementations compare?
> 
> I'd forgotten that was there in Ian's stuff.  The earlier entry in the
> fastpath table would take precedence, right?
> 
> I can't be very precise with the numbers, as I'm testing on customer
> hardware, but my code is "noticeably" faster than Ian's (meaning at
> least 10% better) for the large areas typical of whole-window transfers
> and pictures.  This is true on both uncached and shadowed framebuffers,
> and is quite repeatable.
> 
> I think this is mostly down to the cache-preloading of the source data
> that I do and Ian doesn't - we're operating quite close to the memory
> bandwidth here (assuming the destination is at least write-combined), so
> latency hiding is a Good Thing.
> 
> The difference is also positive on small areas, such as 32x32, though
> the difference is small because overhead elsewhere dominates.  I haven't
> measured on very very narrow images, but I would imagine that the same
> principle holds.
> 
> Another valid point would be that Ian's code works on armcc, and mine
> doesn't.  As such, it's admittedly not very helpful to have two totally
> different routines doing the same thing for armcc and gcc.  But if
> somebody would like to write an intrinsics version of my code, perhaps
> that would resolve it.  I'd do it, but I haven't got a copy of armcc.

As far as I know the intrinsics should work on gcc, they just aren't
very fast. So if you write add a version that uses intrinsics, choosing
yours over Ian's becomes pretty easy :)

-Jeff