[Mesa-dev] [PATCH 3/3] i965/fs: Combine tex/fb_write operations (opt)

Sun Apr 12 07:57:00 PDT 2015

On Sun, Apr 12, 2015 at 10:02:03AM +0300, Pohjolainen, Topi wrote:
> On Fri, Apr 10, 2015 at 12:52:04PM -0700, Ben Widawsky wrote:
> > Certain platforms support the ability to sample from a texture, and write it out
> > to the file RT - thus saving a costly send instructions (note that this is a
> > potnential win if one wanted to backport to a tag that didn't have the patch
> > from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),
> > 
> > v2: Modify the algorithm. Instead of iterating in reverse through blocks and
> > insts, since the last block/inst is the only thing which can benefit. Rebased
> > on top of Ken's patching modifying is_last_send
> > 
> > v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
> > Some comment typo fixes and rewordings.
> > Whitespace
> > Move the optimization pass outside of the optimize loop
> > 
> > v4: Some cosmetic changes requested from Ken. These changes ensured that the
> > optimization function always returned true when an optimization occurred, and
> > false when one did not. This behavior did not exist with the original patch. As
> > a result, having the separate helper function which Matt did not like no longer
> > made sense, and so now I believe everyone should be happy.
> > 
> > Braswell data:
> > Benchmark (n=20)   %diff
> > *OglBatch5         -1.4
> > *OglBatch7         -1.79
> > OglFillTexMulti    5.57
> > OglFillTexSingle   1.16
> > OglShMapPcf        0.05
> > OglTexFilterAniso  3.01
> > OglTexFilterTri    1.94
> > 
> > SKL data:
> > NONE COLLECTED
> > 
> > No piglit regressions:
> > (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)
> > 
> > [*] I believe my measurements are incorrect for Batch5-7. If I add this new
> > optimization, but never emit the new instruction I see similar results.
> 
> I'm seeing ~7% (with 95% confidence) decrease in OglBatch6/7 when I'm
> launching resolve clears with the light-weight mechanism provided by blorp.
> This may be totally unrelated but lets see if I get any smarter.

I let OglBatch6 run for some time (160 rounds each), and I get:

x /mnt/before
+ /mnt/after
+------------------------------------------------------------------------------+
|                                               +           x                  |
|                                               +           x                  |
|                                               +           x  x               |
|                                           +   +    x  x   x  x               |
|                                           +   +  x x  x   x  x               |
|                                       +   +   ++ x xx x   xx x               |
|                                       + *++ * +* x*xx+xx  xxxx               |
|                             +  +      + **+ *x+*+x**x+x** xxxxx              |
|                             +  +    + ++*** *x+*+***x+x** xxxxx              |
|                             +  +++++++++***+**+*******x** xxxxxx             |
|                    +   +  +++ ++*++++*+****+**********x**xxxxxxx x++ x       |
|+          +       ++   ** *+***+*+*+**+******************x*x***x+*** * xx*+ x|
|                                |__________|AM_______A__|_____|               |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 160       102.365       122.348       113.472     113.21107     3.6714446
+ 160       93.4825       121.597       110.289     110.03581     4.3771895
Difference at 95.0% confidence
        -3.17526 +/- 0.885251
        -2.80473% +/- 0.781947%
        (Student's t, pooled s = 4.03976)

I'm not sure if one can really conclude much from this, I would almost claim
that my changes just introduce more fluctuation in the fps numbers but nothing
else.

I examined what callgrind tells me. Both master and meta-blorp got the same
amount of frames rendered while the latter does a little less work with
cpu to achieve this. The latter also submits slightly less work for the GPU
since clears are executed without the vertex shader stage. Hence I can't
really explain why it should be any slower.

So if I were you I probably wouldn't worry too much about your results.