[PATCH] micro-optimize RADEONCopySwap in radeon_accel.c for powerpc
Matt Turner
mattst88 at gmail.com
Mon Oct 31 06:01:24 UTC 2016
On Fri, Oct 28, 2016 at 1:28 AM, Jochen Rollwagen <joro-2013 at t-online.de> wrote:
> Hi there,
>
> gcc seems to create some sub-optimal code for the following code sequence in
> radeon_accel.c:
>
> for (; nwords > 0; --nwords, ++d, ++s)
> *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>
> the body of the loop compiles to
>
> lwz 9,40(31)
> lwz 9,0(9)
> rotlwi 10,9,16
> lwz 9,36(31)
> stw 10,0(9)
> lwz 9,44(31)
> addi 9,9,-1
> stw 9,44(31)
> lwz 9,36(31)
> addi 9,9,4
> stw 9,36(31)
> lwz 9,40(31)
> addi 9,9,4
> stw 9,40(31)
>
> this patch adds some (hopefully optimal) assembler code, bringing it in line
> with the other cases in the switch:
>
> diff --git a/src/radeon_accel.c b/src/radeon_accel.c
> index 1def2a3..580fa33 100644
> --- a/src/radeon_accel.c
> +++ b/src/radeon_accel.c
> @@ -138,7 +138,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src,
> unsigned int size, int swap)
> unsigned int nwords = size >> 2;
>
> for (; nwords > 0; --nwords, ++d, ++s)
> - *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
> +#ifdef __powerpc__
> + __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
> + "rlwimi %0,%1,%5,%6,%7\n\t"
> + : "=&r" (*d)
> + : "r" (*s),"i" (16), "i"
> (16),"i" (31) ,"i" (16), "i" (0),"i" (15)
> + :);
> +
> +#else
> + *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
> +#endif
It looks like this code just swaps the two 16-bit shorts. That looks
like a rotate by 16. I don't know anything about PPC assembly, but if
this description [1] of the rlwinm is correct, you should be able to
do it in a single rlwinm instruction, right? E.g., __rlwinm(*s, 16, 0,
31).
Also, if it's possible it would be more readable to embed the
immediates in the assembly text string instead of "i" inputs.
[1] https://gist.github.com/rygorous/1440600
More information about the xorg-driver-ati
mailing list