[PATCH] micro-optimize RADEONCopySwap in radeon_accel.c for powerpc

Matt Turner mattst88 at gmail.com
Mon Oct 31 06:01:24 UTC 2016


On Fri, Oct 28, 2016 at 1:28 AM, Jochen Rollwagen <joro-2013 at t-online.de> wrote:
> Hi there,
>
> gcc seems to create some sub-optimal code for the following code sequence in
> radeon_accel.c:
>
>  for (; nwords > 0; --nwords, ++d, ++s)
>             *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>
> the body of the loop compiles to
>
>     lwz 9,40(31)
>     lwz 9,0(9)
>     rotlwi 10,9,16
>     lwz 9,36(31)
>     stw 10,0(9)
>     lwz 9,44(31)
>     addi 9,9,-1
>     stw 9,44(31)
>     lwz 9,36(31)
>     addi 9,9,4
>     stw 9,36(31)
>     lwz 9,40(31)
>     addi 9,9,4
>     stw 9,40(31)
>
> this patch adds some (hopefully optimal) assembler code, bringing it in line
> with the other cases in the switch:
>
> diff --git a/src/radeon_accel.c b/src/radeon_accel.c
> index 1def2a3..580fa33 100644
> --- a/src/radeon_accel.c
> +++ b/src/radeon_accel.c
> @@ -138,7 +138,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src,
> unsigned int size, int swap)
>             unsigned int nwords = size >> 2;
>
>             for (; nwords > 0; --nwords, ++d, ++s)
> -               *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
> +#ifdef __powerpc__
> +                       __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
> +                                                "rlwimi %0,%1,%5,%6,%7\n\t"
> +                                                 : "=&r" (*d)
> +                                                 : "r" (*s),"i" (16), "i"
> (16),"i" (31) ,"i" (16), "i" (0),"i" (15)
> +                                                 :);
> +
> +#else
> +                       *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
> +#endif

It looks like this code just swaps the two 16-bit shorts. That looks
like a rotate by 16. I don't know anything about PPC assembly, but if
this description [1] of the rlwinm is correct, you should be able to
do it in a single rlwinm instruction, right? E.g., __rlwinm(*s, 16, 0,
31).

Also, if it's possible it would be more readable to embed the
immediates in the assembly text string instead of "i" inputs.

[1] https://gist.github.com/rygorous/1440600


More information about the xorg-driver-ati mailing list