[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload
Jason Ekstrand
jason at jlekstrand.net
Fri Apr 3 08:10:02 PDT 2015
On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Jason Ekstrand <jason at jlekstrand.net> writes:
>
>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>
>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>> before, we now have a straightforward lowering:
>>>>
>>>> 1) All header sources are copied directly using force_writemask_all and,
>>>> since they are guaranteed to be a single register, there are no
>>>> force_sechalf issues.
>>>>
>>>> 2) All non-header sources are copied using the exact same force_sechalf
>>>> and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>
>>>> 3) In order to accommodate older gens that need interleaved colors,
>>>> lower_load_payload detects when the destination is a COMPR4 register
>>>> and automatically interleaves the non-header sources. The
>>>> lower_load_payload pass does the right thing here regardless of whether
>>>> or not the hardware actually supports COMPR4.
>>>
>>> I had a quick glance at the series and it seems to be going in the right
>>> direction.
>>>
>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>> instruction has more complex semantics than necessary and the benefit is
>>> unclear to me.
>>>
>>> I suppose that your motivation was to avoid setting force_writemask_all
>>> in LOAD_PAYLOAD instructions with header. The optimizer should be able
>>> to cope with those flags and get rid of them from the resulting moves
>>> where they are redundant, and if it's not able to it looks like
>>> something that should be fixed anyway. The explicit handling of headers
>>> is responsible for much of the churn in this series and is likely to
>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>> manipulate them.
>>
>> Avoiding force_writemask_all is only half of the motivation and the
>> small half at that. A header source, more properly defined, is a
>> single physical register that, conceptually, applies to all channels.
>> Effectively, a header source (I should have stated this clearly) has
>> two properties:
>>
>> 1) It has force_writemask_all set
>> 2) It is exactly one physical hardware register.
>>
>> This second property is the more important of the two. Most of the
>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>> pile of guesswork and had a ill-conceved "effective width" think in
>> order to figure out how big the register actually was. Making the
>> user specify which sources are header sources eliminates that
>> guesswork. It also has the nice side-effect that we can do the right
>> force_writemask_all and we can properly handle COMPR4 for the the
>> user.
>
> Yeah, true, but this seems like the least orthogonal and most annoying
> to use solution for this problem, it forces the caller to provide
> redundant information, it takes into account the saturate flag on some
> arguments and not others, it shuffles sources with respect to the
> specified order when COMPR4 is set, but only for the first four
> non-header sources. I think any of the following solutions would be
> better-behaved than the current approach:
I don't know that saying which sources are headers is really
redundant. It's explicit which is what we want. Yes, the COMPR4
thing is a bit magical but we have to do COMPR4 in lower_load_payload
so we have to have some way of doing it and this method puts the
interleving code in one place instead of two.
> 1/ Use the source width to determine the size of each copy. This would
> imply that the source width carries semantic information and hence
> would have to be left alone by e.g. copy propagation.
That's what do now and it's terrible. The "effective width" field was
basically a width that gets kept.
> 2/ Use the instruction exec size and flags to determine the properties
> of *all* copies. This means that if a header is present the exec
> size would necessarily have to be 8 and the halves of a 16-wide
> register would have to be specified separately, which sounds annoying
> at first but in practice wouldn't necessarily be because it could be
> handled by the LOAD_PAYLOAD() helper based on the argument widths
> without running into problems with optimization passes changing the
> meaning of the instruction. The semantics of the instruction itself
> would be as stupid as possible, but the implementation could still
> trivially recognise 16-wide and COMPR4 copies using the exact same
> mechanism you are using now.
Yes, that might work. I'll try and take a swing at it today. It will
*hopefully* have less code churn than the solution in this series
because the magic will still happen, just in a different place. Of
course, this solution also requires that we lower everything with
force_writemask_all and *hopefully* we can get rid of those and set
force_sechalf appropriately in optimization passes.
> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
> simple, say COLLECT and LOAD_HEADER. "COLLECT dst, src0, ..., srcn"
> would be equivalent to the LOAD_PAYLOAD instruction described in 2,
> but it would only be used to load full-width non-header sources of
> the payload, so you would avoid the need to split 16-wide registers
> in half. "LOAD_HEADER dst, header, payload" would handle the
> asymmetric requirements of prepending a header, like using 8-wide
> instead of exec_size-wide copies and setting force_writemask_all.
> You could use mlen to specify the size of payload as is usual for
> instructions taking a payload source.
I thought about doing something like that but it seems more convoluted
than would be at all worth it.
>> --Jason
>>
>>> Thanks for looking into this BTW.
>>>
>>>> ---
>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 154 ++++++++++++++-------------
>>>> src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 3 -
>>>> 2 files changed, 80 insertions(+), 77 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> index f8e26c0..bc45a38 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
>>>> {
>>>> bool progress = false;
>>>>
>>>> - int vgrf_to_reg[alloc.count];
>>>> - int reg_count = 0;
>>>> - for (unsigned i = 0; i < alloc.count; ++i) {
>>>> - vgrf_to_reg[i] = reg_count;
>>>> - reg_count += alloc.sizes[i];
>>>> - }
>>>> -
>>>> - struct {
>>>> - bool written:1; /* Whether this register has ever been written */
>>>> - bool force_writemask_all:1;
>>>> - bool force_sechalf:1;
>>>> - } metadata[reg_count];
>>>> - memset(metadata, 0, sizeof(metadata));
>>>> -
>>>> foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>>>> - if (inst->dst.file == GRF) {
>>>> - const int dst_reg = vgrf_to_reg[inst->dst.reg] + inst->dst.reg_offset;
>>>> - bool force_sechalf = inst->force_sechalf &&
>>>> - !inst->force_writemask_all;
>>>> - bool toggle_sechalf = inst->dst.width == 16 &&
>>>> - type_sz(inst->dst.type) == 4 &&
>>>> - !inst->force_writemask_all;
>>>> - for (int i = 0; i < inst->regs_written; ++i) {
>>>> - metadata[dst_reg + i].written = true;
>>>> - metadata[dst_reg + i].force_sechalf = force_sechalf;
>>>> - metadata[dst_reg + i].force_writemask_all = inst->force_writemask_all;
>>>> - force_sechalf = (toggle_sechalf != force_sechalf);
>>>> - }
>>>> - }
>>>> -
>>>> if (inst->opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
>>>> assert(inst->dst.file == MRF || inst->dst.file == GRF);
>>>> +
>>>> fs_reg dst = inst->dst;
>>>>
>>>> - for (int i = 0; i < inst->sources; i++) {
>>>> - dst.width = inst->src[i].effective_width;
>>>> - dst.type = inst->src[i].type;
>>>> -
>>>> - if (inst->src[i].file == BAD_FILE) {
>>>> - /* Do nothing but otherwise increment as normal */
>>>> - } else if (dst.file == MRF &&
>>>> - dst.width == 8 &&
>>>> - brw->has_compr4 &&
>>>> - i + 4 < inst->sources &&
>>>> - inst->src[i + 4].equals(horiz_offset(inst->src[i], 8))) {
>>>> - fs_reg compr4_dst = dst;
>>>> - compr4_dst.reg += BRW_MRF_COMPR4;
>>>> - compr4_dst.width = 16;
>>>> - fs_reg compr4_src = inst->src[i];
>>>> - compr4_src.width = 16;
>>>> - fs_inst *mov = MOV(compr4_dst, compr4_src);
>>>> + /* Get rid of COMPR4. We'll add it back in if we need it */
>>>> + if (dst.file == MRF && dst.reg & BRW_MRF_COMPR4)
>>>> + dst.reg = dst.reg & ~BRW_MRF_COMPR4;
>>>> +
>>>> + dst.width = 8;
>>>> + for (uint8_t i = 0; i < inst->header_size; i++) {
>>>> + if (inst->src[i].file != BAD_FILE) {
>>>> + fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
>>>> + fs_reg mov_src = retype(inst->src[i], BRW_REGISTER_TYPE_UD);
>>>> + mov_src.width = 8;
>>>> + fs_inst *mov = MOV(mov_dst, mov_src);
>>>> mov->force_writemask_all = true;
>>>> inst->insert_before(block, mov);
>>>> - /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
>>>> - inst->src[i + 4].file = BAD_FILE;
>>>> - } else {
>>>> - fs_inst *mov = MOV(dst, inst->src[i]);
>>>> - if (inst->src[i].file == GRF) {
>>>> - int src_reg = vgrf_to_reg[inst->src[i].reg] +
>>>> - inst->src[i].reg_offset;
>>>> - mov->force_sechalf = metadata[src_reg].force_sechalf;
>>>> - mov->force_writemask_all = metadata[src_reg].force_writemask_all;
>>>> - } else {
>>>> - /* We don't have any useful metadata for immediates or
>>>> - * uniforms. Assume that any of the channels of the
>>>> - * destination may be used.
>>>> - */
>>>> - assert(inst->src[i].file == IMM ||
>>>> - inst->src[i].file == UNIFORM);
>>>> - mov->force_writemask_all = true;
>>>> - }
>>>> + }
>>>> + dst = offset(dst, 1);
>>>> + }
>>>>
>>>> - if (dst.file == GRF) {
>>>> - const int dst_reg = vgrf_to_reg[dst.reg] + dst.reg_offset;
>>>> - const bool force_writemask = mov->force_writemask_all;
>>>> - metadata[dst_reg].force_writemask_all = force_writemask;
>>>> - metadata[dst_reg].force_sechalf = mov->force_sechalf;
>>>> - if (dst.width * type_sz(dst.type) > 32) {
>>>> - assert(!mov->force_sechalf);
>>>> - metadata[dst_reg + 1].force_writemask_all = force_writemask;
>>>> - metadata[dst_reg + 1].force_sechalf = !force_writemask;
>>>> + dst.width = inst->exec_size;
>>>> + if (inst->dst.file == MRF && (inst->dst.reg & BRW_MRF_COMPR4) &&
>>>> + inst->exec_size > 8) {
>>>> + /* In this case, the payload portion of the LOAD_PAYLOAD isn't
>>>> + * a straightforward copy. Instead, the result of the
>>>> + * LOAD_PAYLOAD is treated as interlaced and the first four
>>>> + * non-header sources are unpacked as:
>>>> + *
>>>> + * m + 0: r0
>>>> + * m + 1: g0
>>>> + * m + 2: b0
>>>> + * m + 3: a0
>>>> + * m + 4: r1
>>>> + * m + 5: g1
>>>> + * m + 6: b1
>>>> + * m + 7: a1
>>>> + *
>>>> + * This is used for gen <= 5 fb writes.
>>>> + */
>>>> + assert(inst->exec_size == 16);
>>>> + assert(inst->header_size + 4 <= inst->sources);
>>>> + for (uint8_t i = inst->header_size; i < inst->header_size + 4; i++) {
>>>> + if (inst->src[i].file != BAD_FILE) {
>>>> + if (brw->has_compr4) {
>>>> + fs_reg compr4_dst = retype(dst, inst->src[i].type);
>>>> + compr4_dst.reg |= BRW_MRF_COMPR4;
>>>> +
>>>> + fs_inst *mov = MOV(compr4_dst, inst->src[i]);
>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>> + mov->saturate = inst->saturate;
>>>> + inst->insert_before(block, mov);
>>>> + } else {
>>>> + /* Platform doesn't have COMPR4. We have to fake it */
>>>> + fs_reg mov_dst = retype(dst, inst->src[i].type);
>>>> + mov_dst.width = 8;
>>>> +
>>>> + fs_inst *mov = MOV(mov_dst, half(inst->src[i], 0));
>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>> + mov->saturate = inst->saturate;
>>>> + inst->insert_before(block, mov);
>>>> +
>>>> + mov = MOV(offset(mov_dst, 4), half(inst->src[i], 1));
>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>> + mov->saturate = inst->saturate;
>>>> + mov->force_sechalf = true;
>>>> + inst->insert_before(block, mov);
>>>> }
>>>> }
>>>>
>>>> - inst->insert_before(block, mov);
>>>> + dst.reg++;
>>>> }
>>>>
>>>> + dst.reg += 4;
>>>> +
>>>> + /* The COMPR4 code took care of the first 4 sources. We'll let
>>>> + * the regular path handle any remaining sources. Yes, we are
>>>> + * modifying the instruction but we're about to delete it so
>>>> + * this really doesn't hurt anything.
>>>> + */
>>>> + inst->header_size += 4;
>>>> + }
>>>> +
>>>> + for (uint8_t i = inst->header_size; i < inst->sources; i++) {
>>>> + if (inst->src[i].file != BAD_FILE) {
>>>> + fs_inst *mov = MOV(retype(dst, inst->src[i].type),
>>>> + inst->src[i]);
>>>> + mov->force_writemask_all = inst->force_writemask_all;
>>>> + mov->saturate = inst->saturate;
>>>> + inst->insert_before(block, mov);
>>>> + }
>>>> dst = offset(dst, 1);
>>>> }
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>> index b7eeb47..a0d8b798 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>> @@ -3471,9 +3471,6 @@ fs_visitor::setup_color_payload(fs_reg *dst, fs_reg color, unsigned components,
>>>> color.type, color.width);
>>>> inst = emit(MOV(dst[len], offset(color, i)));
>>>> inst->saturate = key->clamp_fragment_color;
>>>> - } else if (color.width == 16) {
>>>> - /* We need two BAD_FILE slots for a 16-wide color */
>>>> - len++;
>>>> }
>>>> len++;
>>>> }
>>>> --
>>>> 2.3.4
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list