[Mesa-dev] [PATCH 03/23] i965/fs: Use MOV.nz instead of AND.nz to generate flag on GEN6+

Mon Apr 6 11:35:49 PDT 2015

On Fri, Mar 20, 2015 at 1:58 PM, Ian Romanick <idr at freedesktop.org> wrote:
> From: Ian Romanick <ian.d.romanick at intel.com>
>
> On SNB+, the Boolean result is always 0 or ~0, so MOV.nz produces the
> same effect as AND.nz.  However, later cmod propagation passes can
> handle the MOV.nz, but they cannot handle the AND.nz because the source
> is not generated by a CMP.
>
> It's worth noting that this commit was a lot more effective before
> commit bb22aa0 (i965/fs: Ignore type in cmod prop if scan_inst is CMP.).
> Without that commit, this commit improved ~2,500 shaders on each
> affected platform, including Sandy Bridge.
>
> Ivy Bridge (0x0166):
> total instructions in shared programs: 6291794 -> 6291668 (-0.00%)
> instructions in affected programs:     41207 -> 41081 (-0.31%)
> helped:                                154
> HURT:                                  28
>
> Haswell (0x0426):
> total instructions in shared programs: 5779180 -> 5779054 (-0.00%)
> instructions in affected programs:     37210 -> 37084 (-0.34%)
> helped:                                154
> HURT:                                  28
>
> Broadwell (0x162E):
> total instructions in shared programs: 6823014 -> 6822848 (-0.00%)
> instructions in affected programs:     40195 -> 40029 (-0.41%)
> helped:                                164
> HURT:                                  28
>
> No change on GM45, Iron Lake, Sandy Bridge, Ivy Bridge with NIR, or
> Haswell with NIR.
>
> Signed-off-by: Ian Romanick <ian.d.romanick at intel.com>
> ---

I looked at some helped shaders. They seem to be doing this:

const vec4 ps_c0 = vec4(1.0, -1.0, 0.0, -0.0);
...
        t0_ps.x = (gl_FrontFacing ? ps_c0.x : ps_c0.y);
        t0_ps.y = (gl_FrontFacing ? ps_c0.w : ps_c0.y);
        t0_ps.x = ((-t0_ps.x >= 0.0) ? ps_c0.z : ps_c0.x);

so before this patch we hit the
fs_visitor::try_opt_frontfacing_ternary path for t0_ps.x and not for
t0_ps.y, generating:

asr(8)          g26<1>D         -g0<0,1,0>W     15D
or(8)           g36.1<2>W       g0<0,1,0>W      0x3f80UW
mov(1)          g25<1>F         [0F, 0F, 0F, 0F]VF
and.nz.f0(8)    null            g26<8,8,1>D     1D    <--- this gets
removed with this patch
and(8)          g35<1>D         g36<8,8,1>D     0xbf800000UD
mov(8)          g38<1>F         -g25<0,1,0>F
mov(8)          g40<1>F         g25<0,1,0>F
(+f0) sel(8)    g37<1>F         g38<8,8,1>F     -1F
cmp.ge.f0(8)    null            -g35<8,8,1>F    g25<0,1,0>F
(+f0) sel(8)    g39<1>F         g40<8,8,1>F     1F

After this patch we generate
asr.nz.f0(8)    null            -g0<0,1,0>W     15D
or(8)           g35.1<2>W       g0<0,1,0>W      0x3f80UW
mov(1)          g25<1>F         [0F, 0F, 0F, 0F]VF
and(8)          g34<1>D         g35<8,8,1>D     0xbf800000UD
mov(8)          g37<1>F         -g25<0,1,0>F
mov(8)          g39<1>F         g25<0,1,0>F
(+f0) sel(8)    g36<1>F         g37<8,8,1>F     -1F
cmp.ge.f0(8)    null            -g34<8,8,1>F    g25<0,1,0>F
(+f0) sel(8)    g38<1>F         g39<8,8,1>F     1F

10 instructions to 9. That's an annoying amount of assembly to digest,
but basically we're just benefiting because of the order the uses of
the flag. If we could simply rearrange the flag writes and reads, we
would generate better code, and...

If we could recognize that there are multiple gl_FrontFacing ? ... :
... expressions, we probably would have just emitted asr.nz.f0 and a
couple of SELs.

So I don't really think this patch is helping anything except by accident. :)