[PATCH] micro-optimize RADEONCopySwap in radeon_accel.c for powerpc

Jochen Rollwagen joro-2013 at t-online.de
Tue Nov 1 17:39:57 UTC 2016


Am 31.10.2016 um 19:30 schrieb Matt Turner:
> On Mon, Oct 31, 2016 at 11:16 AM, Jochen Rollwagen
> <joro-2013 at t-online.de> wrote:
>> Am 31.10.2016 um 07:01 schrieb Matt Turner:
>>> On Fri, Oct 28, 2016 at 1:28 AM, Jochen Rollwagen <joro-2013 at t-online.de>
>>> wrote:
>>>> Hi there,
>>>>
>>>> gcc seems to create some sub-optimal code for the following code sequence
>>>> in
>>>> radeon_accel.c:
>>>>
>>>>    for (; nwords > 0; --nwords, ++d, ++s)
>>>>               *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>>>>
>>>> the body of the loop compiles to
>>>>
>>>>       lwz 9,40(31)
>>>>       lwz 9,0(9)
>>>>       rotlwi 10,9,16
>>>>       lwz 9,36(31)
>>>>       stw 10,0(9)
>>>>       lwz 9,44(31)
>>>>       addi 9,9,-1
>>>>       stw 9,44(31)
>>>>       lwz 9,36(31)
>>>>       addi 9,9,4
>>>>       stw 9,36(31)
>>>>       lwz 9,40(31)
>>>>       addi 9,9,4
>>>>       stw 9,40(31)
>>>>
>>>> this patch adds some (hopefully optimal) assembler code, bringing it in
>>>> line
>>>> with the other cases in the switch:
>>>>
>>>> diff --git a/src/radeon_accel.c b/src/radeon_accel.c
>>>> index 1def2a3..580fa33 100644
>>>> --- a/src/radeon_accel.c
>>>> +++ b/src/radeon_accel.c
>>>> @@ -138,7 +138,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src,
>>>> unsigned int size, int swap)
>>>>               unsigned int nwords = size >> 2;
>>>>
>>>>               for (; nwords > 0; --nwords, ++d, ++s)
>>>> -               *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
>>>> +#ifdef __powerpc__
>>>> +                       __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
>>>> +                                                "rlwimi
>>>> %0,%1,%5,%6,%7\n\t"
>>>> +                                                 : "=&r" (*d)
>>>> +                                                 : "r" (*s),"i" (16),
>>>> "i"
>>>> (16),"i" (31) ,"i" (16), "i" (0),"i" (15)
>>>> +                                                 :);
>>>> +
>>>> +#else
>>>> +                       *d = ((*s & 0xffff) << 16) | ((*s >> 16) &
>>>> 0xffff);
>>>> +#endif
>>> It looks like this code just swaps the two 16-bit shorts. That looks
>>> like a rotate by 16. I don't know anything about PPC assembly, but if
>>> this description [1] of the rlwinm is correct, you should be able to
>>> do it in a single rlwinm instruction, right? E.g., __rlwinm(*s, 16, 0,
>>> 31).
>>>
>>> Also, if it's possible it would be more readable to embed the
>>> immediates in the assembly text string instead of "i" inputs.
>>>
>>> [1] https://gist.github.com/rygorous/1440600
>>
>> Hi Matt,
>>
>> thanks for your review.
>>
>> You’re perfectly right, the one rlwinm does the trick –with half the
>> instructions (one).
>>
>> Which very probably really makes this „the optimal“ code sequence.
> Cool!
>
>> As for the integers, putting them directly in the asm statement seems to
>> make gcc’s inline asm interpret them as register names (r0, r16, r31).
>> There’s probably a funky compiler switch for that, but i guess the safest,
>> cleanest version is the one below.
> You might try prefixing them with a # or $. Some assembly language's
> syntax requires such a prefix for immediate values.
>
> Also, if you want  you might be interested in replacing the other
> #ifdef __powerpc__ cases later in that file with uses of
> __builtin_bswap*.

Hi again,

thanks for your suggestions.

I didn’t find any code examples or gcc documentation for your 
suggestions („#“ or „$“ or whatever), so i guess it’s not possible to 
replace the "i" constraints with gcc’s inline asm. Gcc’s inline asm 
constraints are tricky enough, so i wouldn‘t want to experiment, and the 
code works the way it is. Feel free to experiment and change them (if 
you can find a powerpc machine to compile on, hehehe).

As for the gcc builtin’s:

1.they call (or are replaced by) the asm statements already in the code 
anyway on powerpc platforms. So that would be mainly cosmetics

2. replacing the #ifdef powerpc’s + the rest of the code ( the #else 
path) with __builtin_bswap’s *SHOULD* of course work on all platforms, 
but i can only test it on powerpc and wouldn’t want to break anything else.

3. Besides, i’d have to build the driver again, which 3.1. I’m frankly 
too lazy and 3.2. I don’t have the time for.

So as far as i’m concerned the last version of the patch i sent is the 
final one (i’m attaching it). The builtin_bswap stuff could of course be 
done as a separate patch by somebody able to build & test it on a more 
exotic platform, like, say, some x86-variant :-). You can „deal me in“ 
as tester for powerpc, in that case.

On my machine I’m seeing quite noticeable performance improvements with 
the patch. Maybe you could get it into mainline, I’d be interested 
whether other powerpc holdouts see them, too, or it’s just me (it’s my 
code, anyway, which might influence my perception – I’m sure it does !).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: opt_RADEON_HOST_DATA_SWAP_HDW_copyswap_case_for_ppc.patch
Type: text/x-patch
Size: 766 bytes
Desc: not available
URL: <https://lists.x.org/archives/xorg-driver-ati/attachments/20161101/01483fc8/attachment.bin>


More information about the xorg-driver-ati mailing list