[PATCH] micro-optimize RADEONCopySwap in radeon_accel.c for powerpc

Mon Oct 31 18:16:08 UTC 2016

Am 31.10.2016 um 07:01 schrieb Matt Turner:
> On Fri, Oct 28, 2016 at 1:28 AM, Jochen Rollwagen <joro-2013 at t-online.de> wrote:
>> Hi there,
>>
>> gcc seems to create some sub-optimal code for the following code sequence in
>> radeon_accel.c:
>>
>>   for (; nwords > 0; --nwords, ++d, ++s)
>>              *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>>
>> the body of the loop compiles to
>>
>>      lwz 9,40(31)
>>      lwz 9,0(9)
>>      rotlwi 10,9,16
>>      lwz 9,36(31)
>>      stw 10,0(9)
>>      lwz 9,44(31)
>>      addi 9,9,-1
>>      stw 9,44(31)
>>      lwz 9,36(31)
>>      addi 9,9,4
>>      stw 9,36(31)
>>      lwz 9,40(31)
>>      addi 9,9,4
>>      stw 9,40(31)
>>
>> this patch adds some (hopefully optimal) assembler code, bringing it in line
>> with the other cases in the switch:
>>
>> diff --git a/src/radeon_accel.c b/src/radeon_accel.c
>> index 1def2a3..580fa33 100644
>> --- a/src/radeon_accel.c
>> +++ b/src/radeon_accel.c
>> @@ -138,7 +138,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src,
>> unsigned int size, int swap)
>>              unsigned int nwords = size >> 2;
>>
>>              for (; nwords > 0; --nwords, ++d, ++s)
>> -               *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>> +#ifdef __powerpc__
>> +                       __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
>> +                                                "rlwimi %0,%1,%5,%6,%7\n\t"
>> +                                                 : "=&r" (*d)
>> +                                                 : "r" (*s),"i" (16), "i"
>> (16),"i" (31) ,"i" (16), "i" (0),"i" (15)
>> +                                                 :);
>> +
>> +#else
>> +                       *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>> +#endif
> It looks like this code just swaps the two 16-bit shorts. That looks
> like a rotate by 16. I don't know anything about PPC assembly, but if
> this description [1] of the rlwinm is correct, you should be able to
> do it in a single rlwinm instruction, right? E.g., __rlwinm(*s, 16, 0,
> 31).
>
> Also, if it's possible it would be more readable to embed the
> immediates in the assembly text string instead of "i" inputs.
>
> [1] https://gist.github.com/rygorous/1440600

Hi Matt,

thanks for your review.

You’re perfectly right, the one rlwinm does the trick –with half the 
instructions (one).

Which very probably really makes this „the optimal“ code sequence.

As for the integers, putting them directly in the asm statement seems to 
make gcc’s inline asm interpret them as register names (r0, r16, r31). 
There’s probably a funky compiler switch for that, but i guess the 
safest, cleanest version is the one below.

  Cheers

  Jochen

diff --git a/src/radeon_accel.c b/src/radeon_accel.c
index 1def2a3..2de2b96 100644
--- a/src/radeon_accel.c
+++ b/src/radeon_accel.c
@@ -137,8 +137,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src, 
unsigned int size, int swap)
             unsigned int *s = (unsigned int *)src;
             unsigned int nwords = size >> 2;

-           for (; nwords > 0; --nwords, ++d, ++s)
-               *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
+           for (; nwords > 0; --nwords, ++d, ++s)
+#ifdef __powerpc__
+                       __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
+                                                 : "=&r" (*d)
+                                                 : "r" (*s),"i" (16), 
"i" (0),"i" (31)
+                                                         :);
+
+#else
+                       *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
+#endif
             return;
          }
      case RADEON_HOST_DATA_SWAP_32BIT: