[PATCH] micro-optimize RADEONCopySwap in radeon_accel.c for powerpc

Mon Oct 31 18:30:45 UTC 2016

On Mon, Oct 31, 2016 at 11:16 AM, Jochen Rollwagen
<joro-2013 at t-online.de> wrote:
> Am 31.10.2016 um 07:01 schrieb Matt Turner:
>>
>> On Fri, Oct 28, 2016 at 1:28 AM, Jochen Rollwagen <joro-2013 at t-online.de>
>> wrote:
>>>
>>> Hi there,
>>>
>>> gcc seems to create some sub-optimal code for the following code sequence
>>> in
>>> radeon_accel.c:
>>>
>>>   for (; nwords > 0; --nwords, ++d, ++s)
>>>              *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>>>
>>> the body of the loop compiles to
>>>
>>>      lwz 9,40(31)
>>>      lwz 9,0(9)
>>>      rotlwi 10,9,16
>>>      lwz 9,36(31)
>>>      stw 10,0(9)
>>>      lwz 9,44(31)
>>>      addi 9,9,-1
>>>      stw 9,44(31)
>>>      lwz 9,36(31)
>>>      addi 9,9,4
>>>      stw 9,36(31)
>>>      lwz 9,40(31)
>>>      addi 9,9,4
>>>      stw 9,40(31)
>>>
>>> this patch adds some (hopefully optimal) assembler code, bringing it in
>>> line
>>> with the other cases in the switch:
>>>
>>> diff --git a/src/radeon_accel.c b/src/radeon_accel.c
>>> index 1def2a3..580fa33 100644
>>> --- a/src/radeon_accel.c
>>> +++ b/src/radeon_accel.c
>>> @@ -138,7 +138,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src,
>>> unsigned int size, int swap)
>>>              unsigned int nwords = size >> 2;
>>>
>>>              for (; nwords > 0; --nwords, ++d, ++s)
>>> -               *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>>> +#ifdef __powerpc__
>>> +                       __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
>>> +                                                "rlwimi
>>> %0,%1,%5,%6,%7\n\t"
>>> +                                                 : "=&r" (*d)
>>> +                                                 : "r" (*s),"i" (16),
>>> "i"
>>> (16),"i" (31) ,"i" (16), "i" (0),"i" (15)
>>> +                                                 :);
>>> +
>>> +#else
>>> +                       *d = ((*s & 0xffff) << 16) | ((*s >> 16) &
>>> 0xffff);
>>> +#endif
>>
>> It looks like this code just swaps the two 16-bit shorts. That looks
>> like a rotate by 16. I don't know anything about PPC assembly, but if
>> this description [1] of the rlwinm is correct, you should be able to
>> do it in a single rlwinm instruction, right? E.g., __rlwinm(*s, 16, 0,
>> 31).
>>
>> Also, if it's possible it would be more readable to embed the
>> immediates in the assembly text string instead of "i" inputs.
>>
>> [1] https://gist.github.com/rygorous/1440600
>
>
> Hi Matt,
>
> thanks for your review.
>
> You’re perfectly right, the one rlwinm does the trick –with half the
> instructions (one).
>
> Which very probably really makes this „the optimal“ code sequence.

Cool!

> As for the integers, putting them directly in the asm statement seems to
> make gcc’s inline asm interpret them as register names (r0, r16, r31).
> There’s probably a funky compiler switch for that, but i guess the safest,
> cleanest version is the one below.

You might try prefixing them with a # or $. Some assembly language's
syntax requires such a prefix for immediate values.

Also, if you want  you might be interested in replacing the other
#ifdef __powerpc__ cases later in that file with uses of
__builtin_bswap*.