[Mesa-dev] [PATCH v3 002/104] nir: Add a deref instruction type

Mon Apr 9 14:25:54 UTC 2018

On Mon, Apr 9, 2018 at 5:35 AM, Rob Clark <robdclark at gmail.com> wrote:

> On Mon, Apr 9, 2018 at 1:35 AM, Jason Ekstrand <jason at jlekstrand.net>
> wrote:
> > Rather lively discussion we've got going here...
> >
> > On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <robdclark at gmail.com> wrote:
> >>
> >> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen
> >> <bas at basnieuwenhuizen.nl> wrote:
> >> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <robdclark at gmail.com>
> wrote:
> >> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen
> >> >> <bas at basnieuwenhuizen.nl> wrote:
> >> >>>>>>>>>>> +
> >> >>>>>>>>>>> +   /** The mode of the underlying variable */
> >> >>>>>>>>>>> +   nir_variable_mode mode;
> >> >>>>>>>>>>
> >> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of
> >> >>>>>>>>>> nir_print and
> >> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode
> from
> >> >>>>>>>>>> the
> >> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode has
> >> >>>>>>>>>> no
> >> >>>>>>>>>> meaning for physical addressing (where the mode comes from
> the
> >> >>>>>>>>>> pointer).
> >> >>>>>>>>>>
> >> >>>>>>>>>> So maybe just drop deref->mode?
> >> >>>>>>>>>
> >> >>>>>>>>> Isn't it still useful with logical addressing in case a var is
> >> >>>>>>>>> not
> >> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers)
> >
> >
> > Yes, the idea here is to basically be the NIR equivalent of storage
> classes.
> > For cases where you can chase it all the way back to the variable, this
> is
> > pointless.  For cases where you can't, this can be very useful.
> >
> >>
> >> >>>>>>>> not sure, maybe this should just also use fat-pointers like
> >> >>>>>>>> physical
> >> >>>>>>>> addressing does??
> >
> >
> > I *think* (without knowing the future with perfect accuracy) that I'd
> like
> > us to start moving away from fat pointers in SPIR-V -> NIR.  They've been
> > working ok thus far but we already have had complaints from the radv guys
> > that they don't want fat pointers for SLM and I'm not sure there that
> great
> > of a fit for other things.  The big reason is that it's actually fairly
> > painful to change from one fat pointer scheme to another if whatever the
> > spirv_to_nir authors picked doesn't really fit your hardware.
> >
> > I'm not 100% sure what to do but the idea I've got at the moment is
> > something like this:
> >
> >  1) For every mode (or storage class), the driver provides to
> spirv_to_nir a
> > glsl_type which is the type it wants the pointer represented as.
> >  2) Add a deref_to_pointer intrinsic to convert the deref to that type
> and
> > use casts to convert the fat pointer to a deref again
> >
> > Why the deref_to_pointer intrinsic?  Because right now some annoying
> details
> > about nir_intrinsic_info force us to have all drefs have a single
> component.
> > If we didn't, then nir_intrinsic_store_var would have two variable-size
> > sources which aren't the same size.  We could modify the rules and have a
> > concept of a "deref source" and to nir_intrinsic_info and say that a
> deref
> > source can be any size.  That would also work and may actually be a
> better
> > plan.  I'm open to other suggestions as well.
> >
> > The key point, however, is (1) where we just let the driver choose the
> > storage type of pointers and then they can have whatever lowering pass
> they
> > want.  If that's a "standard" fat-pointers pass in nir_lower_io, so be
> it.
> > If they want to translate directly into LLVM derefs, they can do that
> too, I
> > suppose.
> >
> >> >>>>>>>>> Also I could see this being useful in physical addressing too
> to
> >> >>>>>>>>> avoid
> >> >>>>>>>>> all passes working with derefs needing to do the constant
> >> >>>>>>>>> folding?
> >
> >
> > Constant folding is cheap, so I'm not too worried about that.  The bigger
> > thing that actual derefs gain us is the ability of the compiler to reason
> > about derefs.  Suppose, for instance, that you had the following:
> >
> > struct S {
> >    float a;
> >    float b;
> > };
> >
> > location(set = 0, binding = 0) Block {
> >    S arr[10];
> > } foo;
> >
> > void
> > my_func(int i)
> > {
> >    foo.arr[i].a = make_a_value();
> >    foo.arr[i].b = make_another_value();
> >    use_value(foo.arr[i].a);
> > }
> >
> > If everything is lowered to fat pointers, the compiler can't very easily
> > tell that the second line of my_func() didn't stomp foo.arr[i].a and so
> it
> > has to re-load the value in the third line.  With derefs, we can easily
> see
> > that the second line didn't stomp it and just re-use the re-use the
> result
> > of the make_a_value() call.  Yes, you may be able to tell the difference
> > between foo.arr[i].a and foo.arr[i].b with some fancy analysis and
> > arithmetic tracking but it'd be very painful to do.  I've given this
> > substantial thought for almost two years now and haven't actually come up
> > with a good way to do it without the information provided by derefs.
> >
> > The biggest thing that I think we will gain from deref instructions isn't
> > fancy syntax sugar and better nir_printability of pointers.  It also
> isn't
> > the removal of explicit of fat pointers in spirv_to_nir (thought that is
> > also nice).  The biggest thing I'm going for is improving NIR's ability
> to
> > optimize UBO, SSBO, and SLM access.  We're pretty good today for local
> > variables (though there are a couple of known places for improvements)
> but
> > we're utterly rubbish for anything that leaves the current shader
> > invocation.  If we want competent compute performance (which we do!) we
> need
> > to solve that problem.
> >
> >>
> >> >>>>>>>> The problem is that you don't necessarily know the type at
> >> >>>>>>>> compile
> >> >>>>>>>> time (and in the case where you do, you need to do constant
> >> >>>>>>>> folding to
> >> >>>>>>>> figure it out)
> >> >>>>>>>
> >> >>>>>>> So I have two considerations here
> >> >>>>>>>
> >> >>>>>>> 1) for vulkan you always know the mode, even when you don't know
> >> >>>>>>> the var.
> >> >>>>>>> 2)  In CL the mode can still get annotated in the source program
> >> >>>>>>> (CL C
> >> >>>>>>> non-generic pointers) in cases in which we cannot reasonably
> >> >>>>>>> figure it
> >> >>>>>>> out with just constant folding. In those cases the mode is extra
> >> >>>>>>> information that you really lose.
> >> >>>>>>
> >> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ?
> >> >>>>>> global_ptr
> >> >>>>>> : local_ptr)'.. depending on how much we inline all the things,
> >> >>>>>> that
> >> >>>>>> might not get CF'd away.
> >> >>>
> >> >>> How does this even work btw? somefxn has a definition, and the
> >> >>> definition specifies a mode for the argument right? (which is
> >> >>> implicitly __private if the app does not specify anything?)
> >> >>
> >> >> iirc, the cl spec has an example something along these lines..
> >> >>
> >> >> it doesn't require *physical* storage for anything where you don't
> >> >> know what the ptr type is, however.. so fat ptrs in ssa space works
> >> >> out
> >> >>
> >> >>>>>
> >> >>>>> But something like
> >> >>>>> __constant int *ptr_value = ...;
> >> >>>>> store ptr in complex data structure.
> >> >>>>> __constant int* ptr2 = load from complex data structure.
> >> >>>>>
> >> >>>>> Without explicitly annotating ptr2 it is unlikely that constant
> >> >>>>> folding would find that ptr2 is pointing to __constant address
> >> >>>>> space.
> >> >>>>> Hence removing the modes loses valuable information that you
> cannot
> >> >>>>> get back by constant folding. However, if you have a pointer with
> >> >>>>> unknown mode, we could have a special mode (or mode_all?) and you
> >> >>>>> can
> >> >>>>> use the uvec2 representation in that case?
> >
> >
> > I've thought about using a mode of 0 for this or maybe letting you OR all
> > the possible modes together since nir_variable_mode is, after all, a
> > bitfield.  I don't have any huge preference at the moment.
> >
> >>
> >> >>>> hmm, I'm not really getting how deref->mode could magically have
> >> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn
> >> >>>> could stash it in fatptr.y and everyone is happy?  If vtn doesn't
> >> >>>> know
> >> >>>> this, then I don't see how deref->mode helps..
> >> >>>
> >> >>> You mean insert it into the fatptr every time deref_cast is called?
> >> >>>
> >> >>> Wouldn't that blow up the IR size significantly for very little
> >> >>> benefit?
> >> >>
> >> >> in an easy to clean up way, so meh?
> >> >
> >> > We can't clean it up if we want to keep the information. Also nir is
> >> > pretty slow to compile already, so I'd like not to add a significant
> >> > number of instruction for very little benefit.
> >
> >
> > Really?  I mean, I can believe it, but do you have any actual numbers to
> > back this up?  It's considerably faster than some other IRs we have in
> mesa
> > though they are known to be pretty big pigs if we're honest.  I'm very
> open
> > to any suggestions on how to improve compile times.  If NIR is
> fundamentally
> > a pig, we should fix that.
> >
> > I don't think compile time should be the deciding factor in whether or
> not
> > we use fat pointers as I doubt it will have a huge effect.  However, how
> > much work it takes to get at information is a reasonable argument.
> >
> >>
> >> I guess I'm failing to see what information you'd loose.. or how there
> >> would be a problem..
> >>
> >> I'm not strictly against having nir_var_unknown as a possible value
> >> for deref->mode, but I'm not really seeing how that helps anything,
> >> other than adding extra complexity to the IR..
> >>
> >>
> >> >>>>>>
> >> >>>>>> I think I'm leaning towards using fat ptrs for the vk case,
> since I
> >> >>>>>> guess that is a case where you could always expect
> >> >>>>>> nir_src_as_const_value() to work, to get the variable mode.  If
> for
> >> >>>>>> no
> >> >>>>>> other reason than I guess these deref's, if the var is not known,
> >> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have
> to
> >> >>>>>> work differently for compute vs vk.  But maybe Jason already has
> >> >>>>>> some
> >> >>>>>> thoughts about it?
> >
> >
> > I don't see why they would be different.  Vulkan and CL are starting to
> > converge.  In vulkan, you already can't always chase the pointer back to
> the
> > variable VK_KHR_variable_pointers is used.  I don't think that CL is as
> > special as you think it is.  The only thing special about CL is the fact
> > that global and local are in the same address space.  Beyond that, all
> the
> > problems we have to solve are fundamentally the same.
> >
> >>
> >> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would
> not
> >> >>>>> use it even for CL. a generic pointer is just a uint64_t for us,
> >> >>>>> with
> >> >>>>> no bitfield in there for the address space.
> >
> >
> > I tend to agree with this.  (See above discussion).
> >
> >>
> >> >>>>> I think we may need to think a bit more about representation
> >> >>>>> however,
> >> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do e.g.
> >> >>>>> 32-bits for known workgroup pointers), the current deref
> >> >>>>> instructions
> >> >>>>> return 32-bit, and you want something like a uvec2 as pointer
> >> >>>>> representation?
> >> >>>>
> >> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a single
> >> >>>> global address space..  But I guess that is an easy subset of the
> >> >>>> harder case where drivers need to use different instructions.. so a
> >> >>>> pretty simple lowering pass run before lower_io could remap things
> >> >>>> that use fatptrs into something that ignores fatptr.y.  Then opt
> >> >>>> passes make fatptr.y go away.  So both AMD and hw that doesn't
> have a
> >> >>>> flat address space are happy.
> >> >>>
> >> >>> But then you run into other issues, like how are you going to stuff
> a
> >> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for
> Physical64
> >> >>> addressing? Also this means we have to track to the sources back to
> >> >>> the cast/var any time we want to do anything at all with any deref
> >> >>> which seems less efficient to me than just stuffing the deref in
> >> >>> there.
> >> >>
> >> >> so fat ptrs only have to exist in ssa space, not be stored to
> >> >> something with a physically defined size..
> >> >
> >> > how does storing __generic pointers work then? those still need the
> >> > fat bit for your hw right?
> >>
> >> for hw that can't map everything into a single flat address space, you
> >> *can't* store a fat pointer.. oh well, it doesn't mean that that hw
> >> can't implement lesser ocl versions (since this is defn not required
> >> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't
> >> checked the spec.. I guess if intel supports 2.x then it isn't
> >> required..)
> >
> >
> > You can still carve up the address space and put local memory at some
> > absurdly high address and do
> >
> > if (ptr > 0xffffffffffff0000)
> >    store_local(ptr & 0xffff)
> > else
> >    store_global(ptr)
> >
>
> That might be an option if we need to physically store a fat pointer
> (but I'm not convinced we need to).  But as a general solution it is a
> horrible idea.  You either loose information (since when was TMI a
> problem for compilers?) or you get to teach constant folding that even
> though it doesn't know the value of 'a' it does know the value of 'a &
> 0xffffffffffff0000', which is insanity.
>
> And simply not optimizing away the global/local/private turns every
> load/store into something that looks like:
>
>    cmps.f.lt p0.x, c4.x, r0.z
>    ; delay 6 instructions so branch can see the value in p0.x
>    (rpt5)nop
>    br !p0.x, #3
>    ldg.u32 r0.w, r0.y, 1
>    jump #11
>    (jp)cmps.f.lt p0.x, c4.y, r0.z
>    (rpt5)nop   ; delay 6 instruction slots
>    br !p0.x, #8
>    ldl.u32 r0.w, r0.y, 1
>    jump #3
>    ; private is really just global with a driver provided buffer
>    ; and compiler computed offset
>    (jp)add.s r0.y, c10.x   ; add base address of buffer
>    (rpt3)nop               ; alu results not immediately avail and no
> other instr to fill the empty slots
>    add.s r0.y, ... offset calculated from local_id
>    (rpt5)nop               ; 6 slots for result from alu avail to mem
> instructions
>    ldg.u32 r0.w, r0.y, 1
>    (jp)(sy)... and now we have a result..
>
>
> which is simply not an option!  And the result is actually worse since
> we end up with 64b pointer math lowered to 32b instructions!
>
> tbh, I *really* don't see the problem with fat pointers, but if
> someone else doesn't want deref_cast to use fat pointers, then I must
> insist on a different intrinsic which does, and a compiler option for
> vtn to choose which to emit.
>

I'm sorry.  That comment was NOT intended as a suggestion for universal
alternative to fat pointers.  It was merely to demonstrate how an
architecture like Intel's which uses different messages for local vs.
global could implement physical 64-bit (uint64_t) generic pointers.
Someone (looks like Rob from the indentation) made a comment about
difficulties implementing OpenCL 2.x and I was just trying to show that it
could be done.  Of course, stuffing the extra information in a vector
component is easier.

This was a side comment to what I thought was a side comment in the
discussion.

> >>
> >> >>
> >> >> As far as tracking things to the head of the chain of deref
> >> >> instructions, that is a matter of a simple helper or two.  Not like
> >> >> the chain of deref's is going to be 1000's of instructions..
> >
> >
> > Yes and no (mostly no) once you start throwing in phis, selects, and
> loading
> > pointers from variables.  The first two you can explain away by just
> saying,
> > "follow the phi/select sources" but the moment you load a pointer out of
> a
> > variable, you have to do something.  Sure, you could just drop .y and
> stomp
> > it to the known storage class, but that hardly seems ideal.
> >
> >>
> >> >>> Also, what would the something which ignores fatptr.y be? I'd assume
> >> >>> that would be the normal deref based stuff, but requiring fatptr
> >> >>> contradicts that?
> >> >>
> >> >> if you have a flat address space, maybe a pass (or option for
> >> >> lower_io) to just convert everything to load/store_global (since
> >> >> essentially what these GPUs are doing is remapping shared/private
> into
> >> >> the global address space)
> >> >
> >> > but I'd like to only do that if we really don't know the mode
> >> > statically since it is somewhat slower/less flexible. (also radv is
> >> > unlikely to use nir_lower_io for a lot of this stuff since we can
> >> > lower derefs into LLVM GEPs directly)
> >>
> >> I mean, we control all the src code, we can add more options to
> lower_io.
> >>
> >> >
> >> > Hence if we want the cases where we know the mode statically we need
> >> > to not lower the fatptr, but then we have the whole fatptr mess.
> >> >
> >>
> >> well, fatptr mess is unavoidable.. I mostly just fail to see how a
> >> more general solution (ie. storing the variable mode in ssa) is worse
> >> than having to deal with both cases (in ssa or in instr).  The in ssa
> >> case is something that is easy to recover since anything that does
> >> pointer math has to split out the .y component and fuse it back in
> >> after the pointer math.  (And if you have a flat address space, I
> >> guess I'm failing to see why you care about the mode in the first
> >> place.)
> >
> >
> > Yes, fat pointers are inevitable for some hardware at some stage in the
> > compile.  That does not, however, mean that we want fat pointers straight
> > out of SPIR-V.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180409/cc9e09ed/attachment-0001.html>