[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload
Francisco Jerez
currojerez at riseup.net
Fri Apr 3 08:37:47 PDT 2015
Jason Ekstrand <jason at jlekstrand.net> writes:
> On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>
>>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>>> before, we now have a straightforward lowering:
>>>>>
>>>>> 1) All header sources are copied directly using force_writemask_all and,
>>>>> since they are guaranteed to be a single register, there are no
>>>>> force_sechalf issues.
>>>>>
>>>>> 2) All non-header sources are copied using the exact same force_sechalf
>>>>> and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>>
>>>>> 3) In order to accommodate older gens that need interleaved colors,
>>>>> lower_load_payload detects when the destination is a COMPR4 register
>>>>> and automatically interleaves the non-header sources. The
>>>>> lower_load_payload pass does the right thing here regardless of whether
>>>>> or not the hardware actually supports COMPR4.
>>>>
>>>> I had a quick glance at the series and it seems to be going in the right
>>>> direction.
>>>>
>>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>>> instruction has more complex semantics than necessary and the benefit is
>>>> unclear to me.
>>>>
>>>> I suppose that your motivation was to avoid setting force_writemask_all
>>>> in LOAD_PAYLOAD instructions with header. The optimizer should be able
>>>> to cope with those flags and get rid of them from the resulting moves
>>>> where they are redundant, and if it's not able to it looks like
>>>> something that should be fixed anyway. The explicit handling of headers
>>>> is responsible for much of the churn in this series and is likely to
>>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>>> manipulate them.
>>>
>>> Avoiding force_writemask_all is only half of the motivation and the
>>> small half at that. A header source, more properly defined, is a
>>> single physical register that, conceptually, applies to all channels.
>>> Effectively, a header source (I should have stated this clearly) has
>>> two properties:
>>>
>>> 1) It has force_writemask_all set
>>> 2) It is exactly one physical hardware register.
>>>
>>> This second property is the more important of the two. Most of the
>>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>>> pile of guesswork and had a ill-conceved "effective width" think in
>>> order to figure out how big the register actually was. Making the
>>> user specify which sources are header sources eliminates that
>>> guesswork. It also has the nice side-effect that we can do the right
>>> force_writemask_all and we can properly handle COMPR4 for the the
>>> user.
>>
>> Yeah, true, but this seems like the least orthogonal and most annoying
>> to use solution for this problem, it forces the caller to provide
>> redundant information, it takes into account the saturate flag on some
>> arguments and not others, it shuffles sources with respect to the
>> specified order when COMPR4 is set, but only for the first four
>> non-header sources. I think any of the following solutions would be
>> better-behaved than the current approach:
>
> I don't know that saying which sources are headers is really
> redundant. It's explicit which is what we want. Yes, the COMPR4
> thing is a bit magical but we have to do COMPR4 in lower_load_payload
> so we have to have some way of doing it and this method puts the
> interleving code in one place instead of two.
>
Well, at least with the previous approach LOAD_PAYLOAD had consistent
(if broken) semantics across its arguments, and regardless of COMPR4
being used or not, which IMHO is preferable to the modest code saving.
>> 1/ Use the source width to determine the size of each copy. This would
>> imply that the source width carries semantic information and hence
>> would have to be left alone by e.g. copy propagation.
>
> That's what do now and it's terrible. The "effective width" field was
> basically a width that gets kept.
>
Couldn't you just prevent copy propagation from propagating a source of
different width into a LOAD_PAYLOAD? You would still be able to get rid
of the metadata guess and effective_width, but you may have to re-run
copy propagate after lower_load_payload() for the case you missed any
optimization oportunities.
>> 2/ Use the instruction exec size and flags to determine the properties
>> of *all* copies. This means that if a header is present the exec
>> size would necessarily have to be 8 and the halves of a 16-wide
>> register would have to be specified separately, which sounds annoying
>> at first but in practice wouldn't necessarily be because it could be
>> handled by the LOAD_PAYLOAD() helper based on the argument widths
>> without running into problems with optimization passes changing the
>> meaning of the instruction. The semantics of the instruction itself
>> would be as stupid as possible, but the implementation could still
>> trivially recognise 16-wide and COMPR4 copies using the exact same
>> mechanism you are using now.
>
> Yes, that might work. I'll try and take a swing at it today. It will
> *hopefully* have less code churn than the solution in this series
> because the magic will still happen, just in a different place. Of
> course, this solution also requires that we lower everything with
> force_writemask_all and *hopefully* we can get rid of those and set
> force_sechalf appropriately in optimization passes.
>
>> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
>> simple, say COLLECT and LOAD_HEADER. "COLLECT dst, src0, ..., srcn"
>> would be equivalent to the LOAD_PAYLOAD instruction described in 2,
>> but it would only be used to load full-width non-header sources of
>> the payload, so you would avoid the need to split 16-wide registers
>> in half. "LOAD_HEADER dst, header, payload" would handle the
>> asymmetric requirements of prepending a header, like using 8-wide
>> instead of exec_size-wide copies and setting force_writemask_all.
>> You could use mlen to specify the size of payload as is usual for
>> instructions taking a payload source.
>
> I thought about doing something like that but it seems more convoluted
> than would be at all worth it.
>
It's not so obvious to me that 3 would necessarily be more difficult
than getting 2 right, but yeah it would definitely be more effort than 1.
>>> --Jason
>>>
>>>> Thanks for looking into this BTW.
>>>>
>>>>> ---
>>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 154 ++++++++++++++-------------
>>>>> src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 3 -
>>>>> 2 files changed, 80 insertions(+), 77 deletions(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> index f8e26c0..bc45a38 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
>>>>> {
>>>>> bool progress = false;
>>>>>
>>>>> - int vgrf_to_reg[alloc.count];
>>>>> - int reg_count = 0;
>>>>> - for (unsigned i = 0; i < alloc.count; ++i) {
>>>>> - vgrf_to_reg[i] = reg_count;
>>>>> - reg_count += alloc.sizes[i];
>>>>> - }
>>>>> -
>>>>> - struct {
>>>>> - bool written:1; /* Whether this register has ever been written */
>>>>> - bool force_writemask_all:1;
>>>>> - bool force_sechalf:1;
>>>>> - } metadata[reg_count];
>>>>> - memset(metadata, 0, sizeof(metadata));
>>>>> -
>>>>> foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>>>>> - if (inst->dst.file == GRF) {
>>>>> - const int dst_reg = vgrf_to_reg[inst->dst.reg] + inst->dst.reg_offset;
>>>>> - bool force_sechalf = inst->force_sechalf &&
>>>>> - !inst->force_writemask_all;
>>>>> - bool toggle_sechalf = inst->dst.width == 16 &&
>>>>> - type_sz(inst->dst.type) == 4 &&
>>>>> - !inst->force_writemask_all;
>>>>> - for (int i = 0; i < inst->regs_written; ++i) {
>>>>> - metadata[dst_reg + i].written = true;
>>>>> - metadata[dst_reg + i].force_sechalf = force_sechalf;
>>>>> - metadata[dst_reg + i].force_writemask_all = inst->force_writemask_all;
>>>>> - force_sechalf = (toggle_sechalf != force_sechalf);
>>>>> - }
>>>>> - }
>>>>> -
>>>>> if (inst->opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
>>>>> assert(inst->dst.file == MRF || inst->dst.file == GRF);
>>>>> +
>>>>> fs_reg dst = inst->dst;
>>>>>
>>>>> - for (int i = 0; i < inst->sources; i++) {
>>>>> - dst.width = inst->src[i].effective_width;
>>>>> - dst.type = inst->src[i].type;
>>>>> -
>>>>> - if (inst->src[i].file == BAD_FILE) {
>>>>> - /* Do nothing but otherwise increment as normal */
>>>>> - } else if (dst.file == MRF &&
>>>>> - dst.width == 8 &&
>>>>> - brw->has_compr4 &&
>>>>> - i + 4 < inst->sources &&
>>>>> - inst->src[i + 4].equals(horiz_offset(inst->src[i], 8))) {
>>>>> - fs_reg compr4_dst = dst;
>>>>> - compr4_dst.reg += BRW_MRF_COMPR4;
>>>>> - compr4_dst.width = 16;
>>>>> - fs_reg compr4_src = inst->src[i];
>>>>> - compr4_src.width = 16;
>>>>> - fs_inst *mov = MOV(compr4_dst, compr4_src);
>>>>> + /* Get rid of COMPR4. We'll add it back in if we need it */
>>>>> + if (dst.file == MRF && dst.reg & BRW_MRF_COMPR4)
>>>>> + dst.reg = dst.reg & ~BRW_MRF_COMPR4;
>>>>> +
>>>>> + dst.width = 8;
>>>>> + for (uint8_t i = 0; i < inst->header_size; i++) {
>>>>> + if (inst->src[i].file != BAD_FILE) {
>>>>> + fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
>>>>> + fs_reg mov_src = retype(inst->src[i], BRW_REGISTER_TYPE_UD);
>>>>> + mov_src.width = 8;
>>>>> + fs_inst *mov = MOV(mov_dst, mov_src);
>>>>> mov->force_writemask_all = true;
>>>>> inst->insert_before(block, mov);
>>>>> - /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
>>>>> - inst->src[i + 4].file = BAD_FILE;
>>>>> - } else {
>>>>> - fs_inst *mov = MOV(dst, inst->src[i]);
>>>>> - if (inst->src[i].file == GRF) {
>>>>> - int src_reg = vgrf_to_reg[inst->src[i].reg] +
>>>>> - inst->src[i].reg_offset;
>>>>> - mov->force_sechalf = metadata[src_reg].force_sechalf;
>>>>> - mov->force_writemask_all = metadata[src_reg].force_writemask_all;
>>>>> - } else {
>>>>> - /* We don't have any useful metadata for immediates or
>>>>> - * uniforms. Assume that any of the channels of the
>>>>> - * destination may be used.
>>>>> - */
>>>>> - assert(inst->src[i].file == IMM ||
>>>>> - inst->src[i].file == UNIFORM);
>>>>> - mov->force_writemask_all = true;
>>>>> - }
>>>>> + }
>>>>> + dst = offset(dst, 1);
>>>>> + }
>>>>>
>>>>> - if (dst.file == GRF) {
>>>>> - const int dst_reg = vgrf_to_reg[dst.reg] + dst.reg_offset;
>>>>> - const bool force_writemask = mov->force_writemask_all;
>>>>> - metadata[dst_reg].force_writemask_all = force_writemask;
>>>>> - metadata[dst_reg].force_sechalf = mov->force_sechalf;
>>>>> - if (dst.width * type_sz(dst.type) > 32) {
>>>>> - assert(!mov->force_sechalf);
>>>>> - metadata[dst_reg + 1].force_writemask_all = force_writemask;
>>>>> - metadata[dst_reg + 1].force_sechalf = !force_writemask;
>>>>> + dst.width = inst->exec_size;
>>>>> + if (inst->dst.file == MRF && (inst->dst.reg & BRW_MRF_COMPR4) &&
>>>>> + inst->exec_size > 8) {
>>>>> + /* In this case, the payload portion of the LOAD_PAYLOAD isn't
>>>>> + * a straightforward copy. Instead, the result of the
>>>>> + * LOAD_PAYLOAD is treated as interlaced and the first four
>>>>> + * non-header sources are unpacked as:
>>>>> + *
>>>>> + * m + 0: r0
>>>>> + * m + 1: g0
>>>>> + * m + 2: b0
>>>>> + * m + 3: a0
>>>>> + * m + 4: r1
>>>>> + * m + 5: g1
>>>>> + * m + 6: b1
>>>>> + * m + 7: a1
>>>>> + *
>>>>> + * This is used for gen <= 5 fb writes.
>>>>> + */
>>>>> + assert(inst->exec_size == 16);
>>>>> + assert(inst->header_size + 4 <= inst->sources);
>>>>> + for (uint8_t i = inst->header_size; i < inst->header_size + 4; i++) {
>>>>> + if (inst->src[i].file != BAD_FILE) {
>>>>> + if (brw->has_compr4) {
>>>>> + fs_reg compr4_dst = retype(dst, inst->src[i].type);
>>>>> + compr4_dst.reg |= BRW_MRF_COMPR4;
>>>>> +
>>>>> + fs_inst *mov = MOV(compr4_dst, inst->src[i]);
>>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>>> + mov->saturate = inst->saturate;
>>>>> + inst->insert_before(block, mov);
>>>>> + } else {
>>>>> + /* Platform doesn't have COMPR4. We have to fake it */
>>>>> + fs_reg mov_dst = retype(dst, inst->src[i].type);
>>>>> + mov_dst.width = 8;
>>>>> +
>>>>> + fs_inst *mov = MOV(mov_dst, half(inst->src[i], 0));
>>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>>> + mov->saturate = inst->saturate;
>>>>> + inst->insert_before(block, mov);
>>>>> +
>>>>> + mov = MOV(offset(mov_dst, 4), half(inst->src[i], 1));
>>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>>> + mov->saturate = inst->saturate;
>>>>> + mov->force_sechalf = true;
>>>>> + inst->insert_before(block, mov);
>>>>> }
>>>>> }
>>>>>
>>>>> - inst->insert_before(block, mov);
>>>>> + dst.reg++;
>>>>> }
>>>>>
>>>>> + dst.reg += 4;
>>>>> +
>>>>> + /* The COMPR4 code took care of the first 4 sources. We'll let
>>>>> + * the regular path handle any remaining sources. Yes, we are
>>>>> + * modifying the instruction but we're about to delete it so
>>>>> + * this really doesn't hurt anything.
>>>>> + */
>>>>> + inst->header_size += 4;
>>>>> + }
>>>>> +
>>>>> + for (uint8_t i = inst->header_size; i < inst->sources; i++) {
>>>>> + if (inst->src[i].file != BAD_FILE) {
>>>>> + fs_inst *mov = MOV(retype(dst, inst->src[i].type),
>>>>> + inst->src[i]);
>>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>>> + mov->saturate = inst->saturate;
>>>>> + inst->insert_before(block, mov);
>>>>> + }
>>>>> dst = offset(dst, 1);
>>>>> }
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>>> index b7eeb47..a0d8b798 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>>> @@ -3471,9 +3471,6 @@ fs_visitor::setup_color_payload(fs_reg *dst, fs_reg color, unsigned components,
>>>>> color.type, color.width);
>>>>> inst = emit(MOV(dst[len], offset(color, i)));
>>>>> inst->saturate = key->clamp_fragment_color;
>>>>> - } else if (color.width == 16) {
>>>>> - /* We need two BAD_FILE slots for a 16-wide color */
>>>>> - len++;
>>>>> }
>>>>> len++;
>>>>> }
>>>>> --
>>>>> 2.3.4
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150403/389c5820/attachment.sig>
More information about the mesa-dev
mailing list