[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

Fri Apr 3 07:28:28 PDT 2015

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>> before, we now have a straightforward lowering:
>>>
>>>  1) All header sources are copied directly using force_writemask_all and,
>>>     since they are guaranteed to be a single register, there are no
>>>     force_sechalf issues.
>>>
>>>  2) All non-header sources are copied using the exact same force_sechalf
>>>     and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>
>>>  3) In order to accommodate older gens that need interleaved colors,
>>>     lower_load_payload detects when the destination is a COMPR4 register
>>>     and automatically interleaves the non-header sources.  The
>>>     lower_load_payload pass does the right thing here regardless of whether
>>>     or not the hardware actually supports COMPR4.
>>
>> I had a quick glance at the series and it seems to be going in the right
>> direction.
>>
>> One thing I honestly don't like is the ad-hoc and IMHO premature
>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>> instruction has more complex semantics than necessary and the benefit is
>> unclear to me.
>>
>> I suppose that your motivation was to avoid setting force_writemask_all
>> in LOAD_PAYLOAD instructions with header.  The optimizer should be able
>> to cope with those flags and get rid of them from the resulting moves
>> where they are redundant, and if it's not able to it looks like
>> something that should be fixed anyway.  The explicit handling of headers
>> is responsible for much of the churn in this series and is likely to
>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>> manipulate them.
>
> Avoiding force_writemask_all is only half of the motivation and the
> small half at that.  A header source, more properly defined, is a
> single physical register that, conceptually, applies to all channels.
> Effectively, a header source (I should have stated this clearly) has
> two properties:
>
> 1) It has force_writemask_all set
> 2) It is exactly one physical hardware register.
>
> This second property is the more important of the two.  Most of the
> disaster of the previous LOAD_PAYLOAD implementation was that we did a
> pile of guesswork and had a ill-conceved "effective width" think in
> order to figure out how big the register actually was.  Making the
> user specify which sources are header sources eliminates that
> guesswork.  It also has the nice side-effect that we can do the right
> force_writemask_all and we can properly handle COMPR4 for the the
> user.

Yeah, true, but this seems like the least orthogonal and most annoying
to use solution for this problem, it forces the caller to provide
redundant information, it takes into account the saturate flag on some
arguments and not others, it shuffles sources with respect to the
specified order when COMPR4 is set, but only for the first four
non-header sources.  I think any of the following solutions would be
better-behaved than the current approach:

1/ Use the source width to determine the size of each copy.  This would
   imply that the source width carries semantic information and hence
   would have to be left alone by e.g. copy propagation.

2/ Use the instruction exec size and flags to determine the properties
   of *all* copies.  This means that if a header is present the exec
   size would necessarily have to be 8 and the halves of a 16-wide
   register would have to be specified separately, which sounds annoying
   at first but in practice wouldn't necessarily be because it could be
   handled by the LOAD_PAYLOAD() helper based on the argument widths
   without running into problems with optimization passes changing the
   meaning of the instruction.  The semantics of the instruction itself
   would be as stupid as possible, but the implementation could still
   trivially recognise 16-wide and COMPR4 copies using the exact same
   mechanism you are using now.

3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
   simple, say COLLECT and LOAD_HEADER.  "COLLECT dst, src0, ..., srcn"
   would be equivalent to the LOAD_PAYLOAD instruction described in 2,
   but it would only be used to load full-width non-header sources of
   the payload, so you would avoid the need to split 16-wide registers
   in half.  "LOAD_HEADER dst, header, payload" would handle the
   asymmetric requirements of prepending a header, like using 8-wide
   instead of exec_size-wide copies and setting force_writemask_all.
   You could use mlen to specify the size of payload as is usual for
   instructions taking a payload source.

> --Jason
>
>> Thanks for looking into this BTW.
>>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_fs.cpp         | 154 ++++++++++++++-------------
>>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
>>>  2 files changed, 80 insertions(+), 77 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> index f8e26c0..bc45a38 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>> @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
>>>  {
>>>     bool progress = false;
>>>
>>> -   int vgrf_to_reg[alloc.count];
>>> -   int reg_count = 0;
>>> -   for (unsigned i = 0; i < alloc.count; ++i) {
>>> -      vgrf_to_reg[i] = reg_count;
>>> -      reg_count += alloc.sizes[i];
>>> -   }
>>> -
>>> -   struct {
>>> -      bool written:1; /* Whether this register has ever been written */
>>> -      bool force_writemask_all:1;
>>> -      bool force_sechalf:1;
>>> -   } metadata[reg_count];
>>> -   memset(metadata, 0, sizeof(metadata));
>>> -
>>>     foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>>> -      if (inst->dst.file == GRF) {
>>> -         const int dst_reg = vgrf_to_reg[inst->dst.reg] + inst->dst.reg_offset;
>>> -         bool force_sechalf = inst->force_sechalf &&
>>> -                              !inst->force_writemask_all;
>>> -         bool toggle_sechalf = inst->dst.width == 16 &&
>>> -                               type_sz(inst->dst.type) == 4 &&
>>> -                               !inst->force_writemask_all;
>>> -         for (int i = 0; i < inst->regs_written; ++i) {
>>> -            metadata[dst_reg + i].written = true;
>>> -            metadata[dst_reg + i].force_sechalf = force_sechalf;
>>> -            metadata[dst_reg + i].force_writemask_all = inst->force_writemask_all;
>>> -            force_sechalf = (toggle_sechalf != force_sechalf);
>>> -         }
>>> -      }
>>> -
>>>        if (inst->opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
>>>           assert(inst->dst.file == MRF || inst->dst.file == GRF);
>>> +
>>>           fs_reg dst = inst->dst;
>>>
>>> -         for (int i = 0; i < inst->sources; i++) {
>>> -            dst.width = inst->src[i].effective_width;
>>> -            dst.type = inst->src[i].type;
>>> -
>>> -            if (inst->src[i].file == BAD_FILE) {
>>> -               /* Do nothing but otherwise increment as normal */
>>> -            } else if (dst.file == MRF &&
>>> -                       dst.width == 8 &&
>>> -                       brw->has_compr4 &&
>>> -                       i + 4 < inst->sources &&
>>> -                       inst->src[i + 4].equals(horiz_offset(inst->src[i], 8))) {
>>> -               fs_reg compr4_dst = dst;
>>> -               compr4_dst.reg += BRW_MRF_COMPR4;
>>> -               compr4_dst.width = 16;
>>> -               fs_reg compr4_src = inst->src[i];
>>> -               compr4_src.width = 16;
>>> -               fs_inst *mov = MOV(compr4_dst, compr4_src);
>>> +         /* Get rid of COMPR4.  We'll add it back in if we need it */
>>> +         if (dst.file == MRF && dst.reg & BRW_MRF_COMPR4)
>>> +            dst.reg = dst.reg & ~BRW_MRF_COMPR4;
>>> +
>>> +         dst.width = 8;
>>> +         for (uint8_t i = 0; i < inst->header_size; i++) {
>>> +            if (inst->src[i].file != BAD_FILE) {
>>> +               fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
>>> +               fs_reg mov_src = retype(inst->src[i], BRW_REGISTER_TYPE_UD);
>>> +               mov_src.width = 8;
>>> +               fs_inst *mov = MOV(mov_dst, mov_src);
>>>                 mov->force_writemask_all = true;
>>>                 inst->insert_before(block, mov);
>>> -               /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
>>> -               inst->src[i + 4].file = BAD_FILE;
>>> -            } else {
>>> -               fs_inst *mov = MOV(dst, inst->src[i]);
>>> -               if (inst->src[i].file == GRF) {
>>> -                  int src_reg = vgrf_to_reg[inst->src[i].reg] +
>>> -                                inst->src[i].reg_offset;
>>> -                  mov->force_sechalf = metadata[src_reg].force_sechalf;
>>> -                  mov->force_writemask_all = metadata[src_reg].force_writemask_all;
>>> -               } else {
>>> -                  /* We don't have any useful metadata for immediates or
>>> -                   * uniforms.  Assume that any of the channels of the
>>> -                   * destination may be used.
>>> -                   */
>>> -                  assert(inst->src[i].file == IMM ||
>>> -                         inst->src[i].file == UNIFORM);
>>> -                  mov->force_writemask_all = true;
>>> -               }
>>> +            }
>>> +            dst = offset(dst, 1);
>>> +         }
>>>
>>> -               if (dst.file == GRF) {
>>> -                  const int dst_reg = vgrf_to_reg[dst.reg] + dst.reg_offset;
>>> -                  const bool force_writemask = mov->force_writemask_all;
>>> -                  metadata[dst_reg].force_writemask_all = force_writemask;
>>> -                  metadata[dst_reg].force_sechalf = mov->force_sechalf;
>>> -                  if (dst.width * type_sz(dst.type) > 32) {
>>> -                     assert(!mov->force_sechalf);
>>> -                     metadata[dst_reg + 1].force_writemask_all = force_writemask;
>>> -                     metadata[dst_reg + 1].force_sechalf = !force_writemask;
>>> +         dst.width = inst->exec_size;
>>> +         if (inst->dst.file == MRF && (inst->dst.reg & BRW_MRF_COMPR4) &&
>>> +             inst->exec_size > 8) {
>>> +            /* In this case, the payload portion of the LOAD_PAYLOAD isn't
>>> +             * a straightforward copy.  Instead, the result of the
>>> +             * LOAD_PAYLOAD is treated as interlaced and the first four
>>> +             * non-header sources are unpacked as:
>>> +             *
>>> +             * m + 0: r0
>>> +             * m + 1: g0
>>> +             * m + 2: b0
>>> +             * m + 3: a0
>>> +             * m + 4: r1
>>> +             * m + 5: g1
>>> +             * m + 6: b1
>>> +             * m + 7: a1
>>> +             *
>>> +             * This is used for gen <= 5 fb writes.
>>> +             */
>>> +            assert(inst->exec_size == 16);
>>> +            assert(inst->header_size + 4 <= inst->sources);
>>> +            for (uint8_t i = inst->header_size; i < inst->header_size + 4; i++) {
>>> +               if (inst->src[i].file != BAD_FILE) {
>>> +                  if (brw->has_compr4) {
>>> +                     fs_reg compr4_dst = retype(dst, inst->src[i].type);
>>> +                     compr4_dst.reg |= BRW_MRF_COMPR4;
>>> +
>>> +                     fs_inst *mov = MOV(compr4_dst, inst->src[i]);
>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>> +                     mov->saturate = inst->saturate;
>>> +                     inst->insert_before(block, mov);
>>> +                  } else {
>>> +                     /* Platform doesn't have COMPR4.  We have to fake it */
>>> +                     fs_reg mov_dst = retype(dst, inst->src[i].type);
>>> +                     mov_dst.width = 8;
>>> +
>>> +                     fs_inst *mov = MOV(mov_dst, half(inst->src[i], 0));
>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>> +                     mov->saturate = inst->saturate;
>>> +                     inst->insert_before(block, mov);
>>> +
>>> +                     mov = MOV(offset(mov_dst, 4), half(inst->src[i], 1));
>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>> +                     mov->saturate = inst->saturate;
>>> +                     mov->force_sechalf = true;
>>> +                     inst->insert_before(block, mov);
>>>                    }
>>>                 }
>>>
>>> -               inst->insert_before(block, mov);
>>> +               dst.reg++;
>>>              }
>>>
>>> +            dst.reg += 4;
>>> +
>>> +            /* The COMPR4 code took care of the first 4 sources.  We'll let
>>> +             * the regular path handle any remaining sources.  Yes, we are
>>> +             * modifying the instruction but we're about to delete it so
>>> +             * this really doesn't hurt anything.
>>> +             */
>>> +            inst->header_size += 4;
>>> +         }
>>> +
>>> +         for (uint8_t i = inst->header_size; i < inst->sources; i++) {
>>> +            if (inst->src[i].file != BAD_FILE) {
>>> +               fs_inst *mov = MOV(retype(dst, inst->src[i].type),
>>> +                                  inst->src[i]);
>>> +               mov->force_writemask_all = inst->force_writemask_all;
>>> +               mov->saturate = inst->saturate;
>>> +               inst->insert_before(block, mov);
>>> +            }
>>>              dst = offset(dst, 1);
>>>           }
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>> index b7eeb47..a0d8b798 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>> @@ -3471,9 +3471,6 @@ fs_visitor::setup_color_payload(fs_reg *dst, fs_reg color, unsigned components,
>>>                                color.type, color.width);
>>>              inst = emit(MOV(dst[len], offset(color, i)));
>>>              inst->saturate = key->clamp_fragment_color;
>>> -         } else if (color.width == 16) {
>>> -            /* We need two BAD_FILE slots for a 16-wide color */
>>> -            len++;
>>>           }
>>>           len++;
>>>        }
>>> --
>>> 2.3.4
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150403/3d98b614/attachment.sig>