[Mesa-dev] [PATCH 4/4] RFC: nir: add lowering for idiv/udiv/umod

Wed Apr 1 12:21:23 PDT 2015

Am 01.04.2015 um 20:11 schrieb Matt Turner:
> On Tue, Mar 31, 2015 at 6:44 PM, Rob Clark <robdclark at gmail.com> wrote:
>> On Tue, Mar 31, 2015 at 9:03 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> So if this is not enough precision, maybe should state how large the
>>> error can be?
>>>
>>
>> tbh, if I knew what the error for this approach was, I would have
>> included it.  I'm not the original author, but this is based on
>> nouveau codegen code (as mentioned in the comment).  I guess it is
>> better than converting to float and dividing and converting back, but
>> worse than an iterative (ie. looping, ie. divergent flow control)
>> approach.  It is apparently enough to keep piglit happy.
>>
>> The original algo in nv50 lowering code is from
>> 322bc7ed68ed92233c97168c036d0aa50c11a20e (ie. 'nv50/ir: import nv50
>> target') which doesn't really give more clue about the origin..
>>
>> if anyone knows, I'm all ears and will add relevant links/info to comment..
> 
> FWIW: The DEC Alpha doesn't have an integer divide instruction, and
> its architecture reference manual [0] suggests in section A.4.2:

I think though this assumes that you don't have a fast float rcp. At
least if that rcp would be exact even a naive implementation of int div
with floats can't really have a large error - up to 2^24 (for both
nominator and denominator) things should always be exact. And the
largest error should be for something like (2^32 - 128) / 1, and thus
would be "only" 128. Doesn't seem improbably there are ways to fix that up.
In any case if there's hw specific helpers for this stuff it's probably
best to implement it in drivers. Doesn't mean though a generic way
wouldn't be useful.

Roland

> 
>> For the remaining cases, a table lookup on about a 1000-entry table and
>> a multiply can give a linear approximation to 1/divisor that is accurate to 16 bits.
>>
>> Using this approximation, a multiply and a back-multiply and a subtract can
>> generate one 16-bit quotient digit plus a 48-bit new partial dividend. Three
>> more such steps can generate the full quotient. Having prior knowledge of the
>> possible sizes of the divisor and dividend, normalizing away leading bytes of
>> zeros, and performing an early-out test can reduce the average number of
>> multiplies to about five (compared to a best case of one and a worst case of
>> nine).
> 
> I read on IRC that at least some (radeon?) hardware has an instruction
> like recip_int that might make this a compelling option, since you
> wouldn't even need the table. I can probably dig up an implementation
> if this sounds interesting.
> 
> Otherwise, I think it's a series of subtracts and shifts.
> 
> [0] https://urldefense.proofpoint.com/v2/url?u=http-3A__download.majix.org_dec_alpha-5Farch-5Fref.pdf&d=AwIBaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=V0XkcgELItL6ZswVFWb7LhVP-28gQnbYUBCXX9mMe1w&s=d4pA26pX2AcOXVwMF1dRLbIarppe8bVzZOzne3TlZdA&e= 
>