[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

Wed Apr 15 13:11:46 PDT 2015

On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Jason Ekstrand <jason at jlekstrand.net> writes:
>
>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>
>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>> before, we now have a straightforward lowering:
>>>>
>>>>  1) All header sources are copied directly using force_writemask_all and,
>>>>     since they are guaranteed to be a single register, there are no
>>>>     force_sechalf issues.
>>>>
>>>>  2) All non-header sources are copied using the exact same force_sechalf
>>>>     and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>
>>>>  3) In order to accommodate older gens that need interleaved colors,
>>>>     lower_load_payload detects when the destination is a COMPR4 register
>>>>     and automatically interleaves the non-header sources.  The
>>>>     lower_load_payload pass does the right thing here regardless of whether
>>>>     or not the hardware actually supports COMPR4.
>>>
>>> I had a quick glance at the series and it seems to be going in the right
>>> direction.
>>>
>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>> instruction has more complex semantics than necessary and the benefit is
>>> unclear to me.
>>>
>>> I suppose that your motivation was to avoid setting force_writemask_all
>>> in LOAD_PAYLOAD instructions with header.  The optimizer should be able
>>> to cope with those flags and get rid of them from the resulting moves
>>> where they are redundant, and if it's not able to it looks like
>>> something that should be fixed anyway.  The explicit handling of headers
>>> is responsible for much of the churn in this series and is likely to
>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>> manipulate them.
>>
>> Avoiding force_writemask_all is only half of the motivation and the
>> small half at that.  A header source, more properly defined, is a
>> single physical register that, conceptually, applies to all channels.
>> Effectively, a header source (I should have stated this clearly) has
>> two properties:
>>
>> 1) It has force_writemask_all set
>> 2) It is exactly one physical hardware register.
>>
>> This second property is the more important of the two.  Most of the
>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>> pile of guesswork and had a ill-conceved "effective width" think in
>> order to figure out how big the register actually was.  Making the
>> user specify which sources are header sources eliminates that
>> guesswork.  It also has the nice side-effect that we can do the right
>> force_writemask_all and we can properly handle COMPR4 for the the
>> user.
>
> Yeah, true, but this seems like the least orthogonal and most annoying
> to use solution for this problem, it forces the caller to provide
> redundant information, it takes into account the saturate flag on some
> arguments and not others,

There's no way to get a saturated load_payload, as far as I can tell.

Let's just avoid the problem you mention all together and continue not
doing it. I agree that this patch adding support for that is not a
good idea.

> it shuffles sources with respect to the
> specified order when COMPR4 is set, but only for the first four
> non-header sources.

To confirm I understand -- the existing lower_load_payload takes 2x
8-channel color sources and recognizes that src[i] and src[i+4] can be
emitted as a compr4 MOV. The proposed patch changes that to take 1x
16-channel color source and just emits a compr4 MOV.

I don't really find that change particularly contentious. In fact, I
kind of wonder what case we were trying to handle by allowing separate
sources for the first 8-channels and the second 8-channels. Is there
some case where these are going to be produced by different
instructions that can't do compr4? Texturing or math instructions come
to mind, but they can't have MRF destinations anyway.

In another email, you mention that compr4 change in this patch means
that optimization passes need to know some extra information.

I'm not able to think of any places where we'd need that information
in optimization passes. The only passes that really operate on
load_payload are CSE, copy propagate, and register coalescing; and I
don't think any of these will touch things writing to MRFs. Am I
forgetting something?

I think I understand the concern in general -- the sources are copied
into different destination registers dependent on (dst.reg &
BRW_MRF_COMPR4) == 0. That's a little strange, but that strangeness is
in the hardware. I mean, if it's strange for load_payload to do this,
isn't it strange for other hardware instructions to do it? Or is it
just that it's a virtual opcode doing a hardware-specific thing?

I don't know. It seems like it's just hiding a little bit of
complexity ("give me 16-channels and I'll emit compr4 MOVs to put them
in the right hardware-specific places").

> I think any of the following solutions would be
> better-behaved than the current approach:
>
> 1/ Use the source width to determine the size of each copy.  This would
>    imply that the source width carries semantic information and hence
>    would have to be left alone by e.g. copy propagation.
>
> 2/ Use the instruction exec size and flags to determine the properties
>    of *all* copies.  This means that if a header is present the exec
>    size would necessarily have to be 8 and the halves of a 16-wide
>    register would have to be specified separately, which sounds annoying
>    at first but in practice wouldn't necessarily be because it could be
>    handled by the LOAD_PAYLOAD() helper based on the argument widths
>    without running into problems with optimization passes changing the
>    meaning of the instruction.  The semantics of the instruction itself
>    would be as stupid as possible, but the implementation could still
>    trivially recognise 16-wide and COMPR4 copies using the exact same
>    mechanism you are using now.
>
> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
>    simple, say COLLECT and LOAD_HEADER.  "COLLECT dst, src0, ..., srcn"
>    would be equivalent to the LOAD_PAYLOAD instruction described in 2,
>    but it would only be used to load full-width non-header sources of
>    the payload, so you would avoid the need to split 16-wide registers
>    in half.  "LOAD_HEADER dst, header, payload" would handle the
>    asymmetric requirements of prepending a header, like using 8-wide
>    instead of exec_size-wide copies and setting force_writemask_all.
>    You could use mlen to specify the size of payload as is usual for
>    instructions taking a payload source.

This case seems potentially compelling, but I think it's likely that
it'll lead to cases where we have a header shared by two payloads,
resulting in the header having to be copied to both payloads which is
kind of silly when the header is produced by a MOV and an OR or
something. :)

I've definitely seen texture payloads that are identical except for
one component, and as a result we generate the three identical values
and then copy them twice.