[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

Fri Apr 3 08:37:47 PDT 2015

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>
>>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>>> before, we now have a straightforward lowering:
>>>>>
>>>>>  1) All header sources are copied directly using force_writemask_all and,
>>>>>     since they are guaranteed to be a single register, there are no
>>>>>     force_sechalf issues.
>>>>>
>>>>>  2) All non-header sources are copied using the exact same force_sechalf
>>>>>     and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>>
>>>>>  3) In order to accommodate older gens that need interleaved colors,
>>>>>     lower_load_payload detects when the destination is a COMPR4 register
>>>>>     and automatically interleaves the non-header sources.  The
>>>>>     lower_load_payload pass does the right thing here regardless of whether
>>>>>     or not the hardware actually supports COMPR4.
>>>>
>>>> I had a quick glance at the series and it seems to be going in the right
>>>> direction.
>>>>
>>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>>> instruction has more complex semantics than necessary and the benefit is
>>>> unclear to me.
>>>>
>>>> I suppose that your motivation was to avoid setting force_writemask_all
>>>> in LOAD_PAYLOAD instructions with header.  The optimizer should be able
>>>> to cope with those flags and get rid of them from the resulting moves
>>>> where they are redundant, and if it's not able to it looks like
>>>> something that should be fixed anyway.  The explicit handling of headers
>>>> is responsible for much of the churn in this series and is likely to
>>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>>> manipulate them.
>>>
>>> Avoiding force_writemask_all is only half of the motivation and the
>>> small half at that.  A header source, more properly defined, is a
>>> single physical register that, conceptually, applies to all channels.
>>> Effectively, a header source (I should have stated this clearly) has
>>> two properties:
>>>
>>> 1) It has force_writemask_all set
>>> 2) It is exactly one physical hardware register.
>>>
>>> This second property is the more important of the two.  Most of the
>>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>>> pile of guesswork and had a ill-conceved "effective width" think in
>>> order to figure out how big the register actually was.  Making the
>>> user specify which sources are header sources eliminates that
>>> guesswork.  It also has the nice side-effect that we can do the right
>>> force_writemask_all and we can properly handle COMPR4 for the the
>>> user.
>>
>> Yeah, true, but this seems like the least orthogonal and most annoying
>> to use solution for this problem, it forces the caller to provide
>> redundant information, it takes into account the saturate flag on some
>> arguments and not others, it shuffles sources with respect to the
>> specified order when COMPR4 is set, but only for the first four
>> non-header sources.  I think any of the following solutions would be
>> better-behaved than the current approach:
>
> I don't know that saying which sources are headers is really
> redundant.  It's explicit which is what we want.  Yes, the COMPR4
> thing is a bit magical but we have to do COMPR4 in lower_load_payload
> so we have to have some way of doing it and this method puts the
> interleving code in one place instead of two.
>
Well, at least with the previous approach LOAD_PAYLOAD had consistent
(if broken) semantics across its arguments, and regardless of COMPR4
being used or not, which IMHO is preferable to the modest code saving.

>> 1/ Use the source width to determine the size of each copy.  This would
>>    imply that the source width carries semantic information and hence
>>    would have to be left alone by e.g. copy propagation.
>
> That's what do now and it's terrible.  The "effective width" field was
> basically a width that gets kept.
>
Couldn't you just prevent copy propagation from propagating a source of
different width into a LOAD_PAYLOAD?  You would still be able to get rid
of the metadata guess and effective_width, but you may have to re-run
copy propagate after lower_load_payload() for the case you missed any
optimization oportunities.

>> 2/ Use the instruction exec size and flags to determine the properties
>>    of *all* copies.  This means that if a header is present the exec
>>    size would necessarily have to be 8 and the halves of a 16-wide
>>    register would have to be specified separately, which sounds annoying
>>    at first but in practice wouldn't necessarily be because it could be
>>    handled by the LOAD_PAYLOAD() helper based on the argument widths
>>    without running into problems with optimization passes changing the
>>    meaning of the instruction.  The semantics of the instruction itself
>>    would be as stupid as possible, but the implementation could still
>>    trivially recognise 16-wide and COMPR4 copies using the exact same
>>    mechanism you are using now.
>
> Yes, that might work.  I'll try and take a swing at it today.  It will
> *hopefully* have less code churn than the solution in this series
> because the magic will still happen, just in a different place.  Of
> course, this solution also requires that we lower everything with
> force_writemask_all and *hopefully* we can get rid of those and set
> force_sechalf appropriately in optimization passes.
>
>> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
>>    simple, say COLLECT and LOAD_HEADER.  "COLLECT dst, src0, ..., srcn"
>>    would be equivalent to the LOAD_PAYLOAD instruction described in 2,
>>    but it would only be used to load full-width non-header sources of
>>    the payload, so you would avoid the need to split 16-wide registers
>>    in half.  "LOAD_HEADER dst, header, payload" would handle the
>>    asymmetric requirements of prepending a header, like using 8-wide
>>    instead of exec_size-wide copies and setting force_writemask_all.
>>    You could use mlen to specify the size of payload as is usual for
>>    instructions taking a payload source.
>
> I thought about doing something like that but it seems more convoluted
> than would be at all worth it.
>
It's not so obvious to me that 3 would necessarily be more difficult
than getting 2 right, but yeah it would definitely be more effort than 1.

>>> --Jason
>>>
>>>> Thanks for looking into this BTW.
>>>>
>>>>> ---
>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp         | 154 ++++++++++++++-------------
>>>>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
>>>>>  2 files changed, 80 insertions(+), 77 deletions(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> index f8e26c0..bc45a38 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
>>>>>  {
>>>>>     bool progress = false;
>>>>>
>>>>> -   int vgrf_to_reg[alloc.count];
>>>>> -   int reg_count = 0;
>>>>> -   for (unsigned i = 0; i < alloc.count; ++i) {
>>>>> -      vgrf_to_reg[i] = reg_count;
>>>>> -      reg_count += alloc.sizes[i];
>>>>> -   }
>>>>> -
>>>>> -   struct {
>>>>> -      bool written:1; /* Whether this register has ever been written */
>>>>> -      bool force_writemask_all:1;
>>>>> -      bool force_sechalf:1;
>>>>> -   } metadata[reg_count];
>>>>> -   memset(metadata, 0, sizeof(metadata));
>>>>> -
>>>>>     foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>>>>> -      if (inst->dst.file == GRF) {
>>>>> -         const int dst_reg = vgrf_to_reg[inst->dst.reg] + inst->dst.reg_offset;
>>>>> -         bool force_sechalf = inst->force_sechalf &&
>>>>> -                              !inst->force_writemask_all;
>>>>> -         bool toggle_sechalf = inst->dst.width == 16 &&
>>>>> -                               type_sz(inst->dst.type) == 4 &&
>>>>> -                               !inst->force_writemask_all;
>>>>> -         for (int i = 0; i < inst->regs_written; ++i) {
>>>>> -            metadata[dst_reg + i].written = true;
>>>>> -            metadata[dst_reg + i].force_sechalf = force_sechalf;
>>>>> -            metadata[dst_reg + i].force_writemask_all = inst->force_writemask_all;
>>>>> -            force_sechalf = (toggle_sechalf != force_sechalf);
>>>>> -         }
>>>>> -      }
>>>>> -
>>>>>        if (inst->opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
>>>>>           assert(inst->dst.file == MRF || inst->dst.file == GRF);
>>>>> +
>>>>>           fs_reg dst = inst->dst;
>>>>>
>>>>> -         for (int i = 0; i < inst->sources; i++) {
>>>>> -            dst.width = inst->src[i].effective_width;
>>>>> -            dst.type = inst->src[i].type;
>>>>> -
>>>>> -            if (inst->src[i].file == BAD_FILE) {
>>>>> -               /* Do nothing but otherwise increment as normal */
>>>>> -            } else if (dst.file == MRF &&
>>>>> -                       dst.width == 8 &&
>>>>> -                       brw->has_compr4 &&
>>>>> -                       i + 4 < inst->sources &&
>>>>> -                       inst->src[i + 4].equals(horiz_offset(inst->src[i], 8))) {
>>>>> -               fs_reg compr4_dst = dst;
>>>>> -               compr4_dst.reg += BRW_MRF_COMPR4;
>>>>> -               compr4_dst.width = 16;
>>>>> -               fs_reg compr4_src = inst->src[i];
>>>>> -               compr4_src.width = 16;
>>>>> -               fs_inst *mov = MOV(compr4_dst, compr4_src);
>>>>> +         /* Get rid of COMPR4.  We'll add it back in if we need it */
>>>>> +         if (dst.file == MRF && dst.reg & BRW_MRF_COMPR4)
>>>>> +            dst.reg = dst.reg & ~BRW_MRF_COMPR4;
>>>>> +
>>>>> +         dst.width = 8;
>>>>> +         for (uint8_t i = 0; i < inst->header_size; i++) {
>>>>> +            if (inst->src[i].file != BAD_FILE) {
>>>>> +               fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
>>>>> +               fs_reg mov_src = retype(inst->src[i], BRW_REGISTER_TYPE_UD);
>>>>> +               mov_src.width = 8;
>>>>> +               fs_inst *mov = MOV(mov_dst, mov_src);
>>>>>                 mov->force_writemask_all = true;
>>>>>                 inst->insert_before(block, mov);
>>>>> -               /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
>>>>> -               inst->src[i + 4].file = BAD_FILE;
>>>>> -            } else {
>>>>> -               fs_inst *mov = MOV(dst, inst->src[i]);
>>>>> -               if (inst->src[i].file == GRF) {
>>>>> -                  int src_reg = vgrf_to_reg[inst->src[i].reg] +
>>>>> -                                inst->src[i].reg_offset;
>>>>> -                  mov->force_sechalf = metadata[src_reg].force_sechalf;
>>>>> -                  mov->force_writemask_all = metadata[src_reg].force_writemask_all;
>>>>> -               } else {
>>>>> -                  /* We don't have any useful metadata for immediates or
>>>>> -                   * uniforms.  Assume that any of the channels of the
>>>>> -                   * destination may be used.
>>>>> -                   */
>>>>> -                  assert(inst->src[i].file == IMM ||
>>>>> -                         inst->src[i].file == UNIFORM);
>>>>> -                  mov->force_writemask_all = true;
>>>>> -               }
>>>>> +            }
>>>>> +            dst = offset(dst, 1);
>>>>> +         }
>>>>>
>>>>> -               if (dst.file == GRF) {
>>>>> -                  const int dst_reg = vgrf_to_reg[dst.reg] + dst.reg_offset;
>>>>> -                  const bool force_writemask = mov->force_writemask_all;
>>>>> -                  metadata[dst_reg].force_writemask_all = force_writemask;
>>>>> -                  metadata[dst_reg].force_sechalf = mov->force_sechalf;
>>>>> -                  if (dst.width * type_sz(dst.type) > 32) {
>>>>> -                     assert(!mov->force_sechalf);
>>>>> -                     metadata[dst_reg + 1].force_writemask_all = force_writemask;
>>>>> -                     metadata[dst_reg + 1].force_sechalf = !force_writemask;
>>>>> +         dst.width = inst->exec_size;
>>>>> +         if (inst->dst.file == MRF && (inst->dst.reg & BRW_MRF_COMPR4) &&
>>>>> +             inst->exec_size > 8) {
>>>>> +            /* In this case, the payload portion of the LOAD_PAYLOAD isn't
>>>>> +             * a straightforward copy.  Instead, the result of the
>>>>> +             * LOAD_PAYLOAD is treated as interlaced and the first four
>>>>> +             * non-header sources are unpacked as:
>>>>> +             *
>>>>> +             * m + 0: r0
>>>>> +             * m + 1: g0
>>>>> +             * m + 2: b0
>>>>> +             * m + 3: a0
>>>>> +             * m + 4: r1
>>>>> +             * m + 5: g1
>>>>> +             * m + 6: b1
>>>>> +             * m + 7: a1
>>>>> +             *
>>>>> +             * This is used for gen <= 5 fb writes.
>>>>> +             */
>>>>> +            assert(inst->exec_size == 16);
>>>>> +            assert(inst->header_size + 4 <= inst->sources);
>>>>> +            for (uint8_t i = inst->header_size; i < inst->header_size + 4; i++) {
>>>>> +               if (inst->src[i].file != BAD_FILE) {
>>>>> +                  if (brw->has_compr4) {
>>>>> +                     fs_reg compr4_dst = retype(dst, inst->src[i].type);
>>>>> +                     compr4_dst.reg |= BRW_MRF_COMPR4;
>>>>> +
>>>>> +                     fs_inst *mov = MOV(compr4_dst, inst->src[i]);
>>>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>>>> +                     mov->saturate = inst->saturate;
>>>>> +                     inst->insert_before(block, mov);
>>>>> +                  } else {
>>>>> +                     /* Platform doesn't have COMPR4.  We have to fake it */
>>>>> +                     fs_reg mov_dst = retype(dst, inst->src[i].type);
>>>>> +                     mov_dst.width = 8;
>>>>> +
>>>>> +                     fs_inst *mov = MOV(mov_dst, half(inst->src[i], 0));
>>>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>>>> +                     mov->saturate = inst->saturate;
>>>>> +                     inst->insert_before(block, mov);
>>>>> +
>>>>> +                     mov = MOV(offset(mov_dst, 4), half(inst->src[i], 1));
>>>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>>>> +                     mov->saturate = inst->saturate;
>>>>> +                     mov->force_sechalf = true;
>>>>> +                     inst->insert_before(block, mov);
>>>>>                    }
>>>>>                 }
>>>>>
>>>>> -               inst->insert_before(block, mov);
>>>>> +               dst.reg++;
>>>>>              }
>>>>>
>>>>> +            dst.reg += 4;
>>>>> +
>>>>> +            /* The COMPR4 code took care of the first 4 sources.  We'll let
>>>>> +             * the regular path handle any remaining sources.  Yes, we are
>>>>> +             * modifying the instruction but we're about to delete it so
>>>>> +             * this really doesn't hurt anything.
>>>>> +             */
>>>>> +            inst->header_size += 4;
>>>>> +         }
>>>>> +
>>>>> +         for (uint8_t i = inst->header_size; i < inst->sources; i++) {
>>>>> +            if (inst->src[i].file != BAD_FILE) {
>>>>> +               fs_inst *mov = MOV(retype(dst, inst->src[i].type),
>>>>> +                                  inst->src[i]);
>>>>> +               mov->force_writemask_all = inst->force_writemask_all;
>>>>> +               mov->saturate = inst->saturate;
>>>>> +               inst->insert_before(block, mov);
>>>>> +            }
>>>>>              dst = offset(dst, 1);
>>>>>           }
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>>> index b7eeb47..a0d8b798 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>>> @@ -3471,9 +3471,6 @@ fs_visitor::setup_color_payload(fs_reg *dst, fs_reg color, unsigned components,
>>>>>                                color.type, color.width);
>>>>>              inst = emit(MOV(dst[len], offset(color, i)));
>>>>>              inst->saturate = key->clamp_fragment_color;
>>>>> -         } else if (color.width == 16) {
>>>>> -            /* We need two BAD_FILE slots for a 16-wide color */
>>>>> -            len++;
>>>>>           }
>>>>>           len++;
>>>>>        }
>>>>> --
>>>>> 2.3.4
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150403/389c5820/attachment.sig>