[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

Fri Apr 3 13:18:54 PDT 2015

Jason Ekstrand <jason at jlekstrand.net> writes:

> On Fri, Apr 3, 2015 at 8:37 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>
>>> On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>
>>>>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>>>
>>>>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>>>>> before, we now have a straightforward lowering:
>>>>>>>
>>>>>>>  1) All header sources are copied directly using force_writemask_all and,
>>>>>>>     since they are guaranteed to be a single register, there are no
>>>>>>>     force_sechalf issues.
>>>>>>>
>>>>>>>  2) All non-header sources are copied using the exact same force_sechalf
>>>>>>>     and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>>>>
>>>>>>>  3) In order to accommodate older gens that need interleaved colors,
>>>>>>>     lower_load_payload detects when the destination is a COMPR4 register
>>>>>>>     and automatically interleaves the non-header sources.  The
>>>>>>>     lower_load_payload pass does the right thing here regardless of whether
>>>>>>>     or not the hardware actually supports COMPR4.
>>>>>>
>>>>>> I had a quick glance at the series and it seems to be going in the right
>>>>>> direction.
>>>>>>
>>>>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>>>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>>>>> instruction has more complex semantics than necessary and the benefit is
>>>>>> unclear to me.
>>>>>>
>>>>>> I suppose that your motivation was to avoid setting force_writemask_all
>>>>>> in LOAD_PAYLOAD instructions with header.  The optimizer should be able
>>>>>> to cope with those flags and get rid of them from the resulting moves
>>>>>> where they are redundant, and if it's not able to it looks like
>>>>>> something that should be fixed anyway.  The explicit handling of headers
>>>>>> is responsible for much of the churn in this series and is likely to
>>>>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>>>>> manipulate them.
>>>>>
>>>>> Avoiding force_writemask_all is only half of the motivation and the
>>>>> small half at that.  A header source, more properly defined, is a
>>>>> single physical register that, conceptually, applies to all channels.
>>>>> Effectively, a header source (I should have stated this clearly) has
>>>>> two properties:
>>>>>
>>>>> 1) It has force_writemask_all set
>>>>> 2) It is exactly one physical hardware register.
>>>>>
>>>>> This second property is the more important of the two.  Most of the
>>>>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>>>>> pile of guesswork and had a ill-conceved "effective width" think in
>>>>> order to figure out how big the register actually was.  Making the
>>>>> user specify which sources are header sources eliminates that
>>>>> guesswork.  It also has the nice side-effect that we can do the right
>>>>> force_writemask_all and we can properly handle COMPR4 for the the
>>>>> user.
>
> Ok, Allow me to be a bit more explicit as to what all we need to keep track of:
>
>  1) How big is the source register for real.  Even immediates can end
> up being two registers in the copy.
>  2) Do we want force_writemask_all?
>  3) If not, do we want force_sechalf?
>  4) On g45 and gen5, we want to use COMPR4 for interlaced movs
>  5) When lowering, we want to use 16-wide moves when possible in SIMD16
>
> With the patch series I sent, all of this is explicit except for
> COMPR4 which is, admittedly, kind of magic.  "Which of these sources
> are headers?" is a reasonable question for the caller to answer.  It
> knows explicitly and it would take the LOAD_PAYLOAD helper some work
> to guess it correctly.  Another option would be to guess that based on
> exec sizes but then the caller has to know not to pass in the wrong
> register type or the guess will be wrong.  I like explicit.
>
I'm not advocating LOAD_PAYLOAD to perform any guesswork, I'm advocating
it to be more stupid -- Let it do as much as it can sensibly do with the
information it already has, and no more.

>>>> Yeah, true, but this seems like the least orthogonal and most annoying
>>>> to use solution for this problem, it forces the caller to provide
>>>> redundant information, it takes into account the saturate flag on some
>>>> arguments and not others, it shuffles sources with respect to the
>>>> specified order when COMPR4 is set, but only for the first four
>>>> non-header sources.  I think any of the following solutions would be
>>>> better-behaved than the current approach:
>>>
>>> I don't know that saying which sources are headers is really
>>> redundant.  It's explicit which is what we want.  Yes, the COMPR4
>>> thing is a bit magical but we have to do COMPR4 in lower_load_payload
>>> so we have to have some way of doing it and this method puts the
>>> interleving code in one place instead of two.
>>>
>> Well, at least with the previous approach LOAD_PAYLOAD had consistent
>> (if broken) semantics across its arguments, and regardless of COMPR4
>> being used or not, which IMHO is preferable to the modest code saving.
>
> To be clear, I don't really like the way I did COMPR4 either.  I just
> couldn't come up with anything better.
>
I don't see how the new handling of COMPR4 is more "explicit".  It
magically changes how the specified registers are laid out in the
payload and an optimization pass dealing with LOAD_PAYLOAD must be aware
of that.  Your previous approach even though not beautiful was
transparent in the sense that the rest of the compiler didn't have to
care about LOAD_PAYLOAD using COMPR4 or not, because it was nothing more
than an implementation detail with no effect on the semantics of the
instruction.

>>>> 1/ Use the source width to determine the size of each copy.  This would
>>>>    imply that the source width carries semantic information and hence
>>>>    would have to be left alone by e.g. copy propagation.
>>>
>>> That's what do now and it's terrible.  The "effective width" field was
>>> basically a width that gets kept.
>>>
>> Couldn't you just prevent copy propagation from propagating a source of
>> different width into a LOAD_PAYLOAD?  You would still be able to get rid
>> of the metadata guess and effective_width, but you may have to re-run
>> copy propagate after lower_load_payload() for the case you missed any
>> optimization oportunities.
>
> Yes, we could prevent copy propagation from messing up exec sizes.
> That solves 1 but 2-5 still require terrible heuristics.  Setting
> force_writemask_all sort-of solves 2 and 3.

Splitting the LOAD_PAYLOAD monster into two separate instructions
(proposal 3) solves this too, without setting force_writemask_all.

> COMPR4 would have to have a hueristic again which is too complicated.

I wouldn't call that a heuristic, the rule is completely deterministic:
If 'src[i+4] == offset(src[i], 1)' enable COMPR4 and skip emitting
src[i+4] explicitly.

> The 16-wide stuff is still a problem.
>
How so?  With solution 1 you emit a 16-wide copy iff the source is
16-wide.  With solution 3 you emit a 16-wide copy iff exec_size is 16.
With solution 2 you emit a 16-wide copy if exec_size is 16 or, as an
additional optimization without effect on the semantics of the
instruction, if exec_size is 8, force_writemask_all is set and
'src[i+1] == offset(src[i], 1)'.

>>>> 2/ Use the instruction exec size and flags to determine the properties
>>>>    of *all* copies.  This means that if a header is present the exec
>>>>    size would necessarily have to be 8 and the halves of a 16-wide
>>>>    register would have to be specified separately, which sounds annoying
>>>>    at first but in practice wouldn't necessarily be because it could be
>>>>    handled by the LOAD_PAYLOAD() helper based on the argument widths
>>>>    without running into problems with optimization passes changing the
>>>>    meaning of the instruction.  The semantics of the instruction itself
>>>>    would be as stupid as possible, but the implementation could still
>>>>    trivially recognise 16-wide and COMPR4 copies using the exact same
>>>>    mechanism you are using now.
>>>
>>> Yes, that might work.  I'll try and take a swing at it today.  It will
>>> *hopefully* have less code churn than the solution in this series
>>> because the magic will still happen, just in a different place.  Of
>>> course, this solution also requires that we lower everything with
>>> force_writemask_all and *hopefully* we can get rid of those and set
>>> force_sechalf appropriately in optimization passes.
>
> Thinking about this some more, I'm not really a fan.    As with your
> first suggestion, we can force_writemask_all and the heuristic for
> COMPR4 is pretty simple.  However, it still has the problem of
> properly handling SIMD16.
>
I think I've answered to that already in the last paragraph.

>>>> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
>>>>    simple, say COLLECT and LOAD_HEADER.  "COLLECT dst, src0, ..., srcn"
>>>>    would be equivalent to the LOAD_PAYLOAD instruction described in 2,
>>>>    but it would only be used to load full-width non-header sources of
>>>>    the payload, so you would avoid the need to split 16-wide registers
>>>>    in half.  "LOAD_HEADER dst, header, payload" would handle the
>>>>    asymmetric requirements of prepending a header, like using 8-wide
>>>>    instead of exec_size-wide copies and setting force_writemask_all.
>>>>    You could use mlen to specify the size of payload as is usual for
>>>>    instructions taking a payload source.
>>>
>>> I thought about doing something like that but it seems more convoluted
>>> than would be at all worth it.
>>>
>> It's not so obvious to me that 3 would necessarily be more difficult
>> than getting 2 right, but yeah it would definitely be more effort than 1.
>
> The problem with this solution is that headers can be different sizes.
> Texturing instructions use 1-reg headers but fb writes a header that's
> between 0 and 4 registers.  Also, we don't want to have to look at the
> size of a vgrf to figure out what the instruction does so we have to
> have some way of encoding the number of registers read in both
> arguments into the instruction.  And why is overloading mlen better
> than renaming and overloading has_header?
>
I'm not against renaming header_present to header_size in principle, I
suggested using mlen because you'd likely need a way to specify both the
size of the header and the size of the payload.  There are alternatives
like using immediate sources but I don't have a very strong preference.

> Swinging back around to a more general discussion...
>
> 1) Any of the solutions currently on the table will take care of this.
> I'm not a big fan of making copy-prop be LOAD_PAYLOAD-aware, but my
> solutions (which I'll call 0) and 2 and 3 handle it ok.
>
> 2+3) My solution gives us this explicitly for free.  The other option,
> which you keep suggesting, is to force_writemask_all the universe.
> This works, but uses more than needed which, at the moment, is ok.
> However, if we want to clean it up, optimization passes will need to
> be aware of it and it's not clear to me how we would handle it.
>
It's simple: an instruction with force_writemask_all set reading from a
register assigned without it is equivalent to the same instruction
without force_writemask_all.  Also an instruction with
force_writemask_all set writing to a register which is never used as
send-from-GRF source or by other instructions with the flag set is
equivalent to the same instruction without force_writemask_all.  But at
this point this all seems like a premature optimization to me.

> 4) My solution is, admittedly, a hack but it's a predictable hack.
> Also, my hack does work for all of the messages we care about and I
> don't think we'll be adding new messages for gen <= 5 so I'm not
> terribly concerned about it.  Having a heuristic at lowering time
> would be nicer because the user can simply say what registers they
> want where and lowering does the right thing.  However, if we have to
> pick between COMPR4 being nice and other things being nice, I'd rather
> leave COMPR4 a hack.
>
I don't see why that's a choice we have to make.  You could just leave
the old-style handling of COMPR4 alone.

> 5) This one is tricky.  CSE can generate payload copies that don't get
> eaten up by register coalesce.  I don't remember exactly how this
> happens but I do know that I've seen it in the wild.  When this
> happens, you can end up with a copy payload that copies an odd number
> of registers.  If you use the simple heuristic of just use SIMD16
> moves until you can't and then use a SIMD8, you end up moving things
> that aren't pairs as pairs.

Yeah, of course, with the current approach and my proposal 1, CSE would
have to preserve the widths of the sources of the original LOAD_PAYLOAD
(or use force_writemask_all), otherwise you end up reading stuff across
channels.

> Sure, if you set force_writemask_all, the code is correct.  However,
> it's possible that register coalesce could have cleaned up part of the
> payload but now it can't because the moves are off-center.  This
> problem is *why* I wrote the nasty pile of heuristics that we have
> now.  None of your suggestions offer a solution to this 
>
Actually this seems less a problem with 2 or 3 because the copies are
homogeneous (i.e. they are exec_size-wide).  With 2 you'd just split up
the temporary in 8-wide sources and assume that the LOAD_PAYLOAD will be
lowered to the optimal sequence of SIMD16/SIMD8 copies.  With 3 you
could even use a LOAD_HEADER with an empty header.  In both cases
register coalesce would have to learn how to treat those as multiple
register copies.

> where as having the user specify the header size gives it to you for
> free.

If you consider "for free" checking if the instruction is a LOAD_PAYLOAD
and in that case mimicking its header/payload structure in the new
LOAD_PAYLOAD, sure.  But it doesn't seem conceptually different to the
solution you could have applied now.

> All in all, I'm now even more convinced that we want the user of
> LOAD_PAYLOAD to just tell us what sources are headers and what aren't.
> I think it's also conceptually cleaner because the header sources are
> different from the rest of the payload.
> --Jason
>
> P.S.  For whatever it's worth, it would also allow us to split
> LOAD_PAYLOAD instructions from SIMD16 to SIMD8 correctly by copying
> the headers directly and splitting the non-headers in half.

I doubt that's something we want to do at the IR level.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150403/8788e616/attachment-0001.sig>