[Mesa-dev] [PATCH 3/3] i965/fs: Combine tex/fb_write operations (opt)

Matt Turner mattst88 at gmail.com
Fri Apr 10 13:20:49 PDT 2015


On Fri, Apr 10, 2015 at 12:52 PM, Ben Widawsky
<benjamin.widawsky at intel.com> wrote:
> Certain platforms support the ability to sample from a texture, and write it out
> to the file RT - thus saving a costly send instructions (note that this is a
> potnential win if one wanted to backport to a tag that didn't have the patch
> from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),
>
> v2: Modify the algorithm. Instead of iterating in reverse through blocks and
> insts, since the last block/inst is the only thing which can benefit. Rebased
> on top of Ken's patching modifying is_last_send
>
> v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
> Some comment typo fixes and rewordings.
> Whitespace
> Move the optimization pass outside of the optimize loop
>
> v4: Some cosmetic changes requested from Ken. These changes ensured that the
> optimization function always returned true when an optimization occurred, and
> false when one did not. This behavior did not exist with the original patch. As
> a result, having the separate helper function which Matt did not like no longer
> made sense, and so now I believe everyone should be happy.
>
> Braswell data:
> Benchmark (n=20)   %diff
> *OglBatch5         -1.4
> *OglBatch7         -1.79
> OglFillTexMulti    5.57
> OglFillTexSingle   1.16
> OglShMapPcf        0.05
> OglTexFilterAniso  3.01
> OglTexFilterTri    1.94
>
> SKL data:
> NONE COLLECTED
>
> No piglit regressions:
> (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)
>
> [*] I believe my measurements are incorrect for Batch5-7. If I add this new
> optimization, but never emit the new instruction I see similar results.
>
> Cc: Matt Turner <mattst88 at gmail.com>
> Cc: Kenneth Graunke <kenneth at whitecape.org>
> Cc: Jason Ekstrand <jason at jlekstrand.net>
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> ---
>  src/mesa/drivers/dri/i965/brw_fs.cpp           | 92 ++++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_fs.h             |  3 +
>  src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 12 ++++
>  3 files changed, 107 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index 72000cf..d56d053 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -2557,6 +2557,96 @@ fs_visitor::opt_algebraic()
>     return progress;
>  }
>
> +/**
> + * Optimize sample messages which are followed by the final RT write.
> + *
> + * CHV, and GEN9+ can mark a texturing SEND instruction with EOT to have its
> + * results sent directly to the framebuffer, bypassing the EU.  Recognize the
> + * final texturing results copied to the framebuffer write payload and modify
> + * them to write to the framebuffer directly.
> + */
> +bool
> +fs_visitor::opt_sampler_eot()
> +{
> +   brw_wm_prog_key *key = (brw_wm_prog_key*) this->key;
> +
> +   if (brw->gen < 9 && !brw->is_cherryview)
> +      return false;
> +
> +   /* FINISHME: It should be possible to implement this optimization when there
> +    * are multiple drawbuffers.
> +    */
> +   if (key->nr_color_regions != 1)
> +      return false;
> +
> +   /* Look for a texturing instruction immediately before the final FB_WRITE. */
> +   fs_inst *fb_write = (fs_inst *) cfg->blocks[cfg->num_blocks - 1]->end();
> +   assert(fb_write->eot);
> +   assert(fb_write->opcode == FS_OPCODE_FB_WRITE);
> +
> +   if (unlikely(fb_write->is_head_sentinel()))
> +       return false;

This is impossible. The assertions above are already assuming (rightly
so) that it's an actual instruction. Remove these two lines.

> +
> +   fs_inst *tex_inst = (fs_inst *) fb_write->prev;
> +
> +   /* There wasn't one; nothing to do. */
> +   if (unlikely(tex_inst->is_head_sentinel()) || !tex_inst->is_tex())
> +      return false;

This looks good.

> +
> +   /* If there's no header present, we need to munge the LOAD_PAYLOAD as well.
> +    * It's very likely to be the previous instruction.
> +    */
> +   fs_inst *load_payload = (fs_inst *) tex_inst->prev;
> +   if (load_payload->is_head_sentinel() ||
> +       load_payload->opcode != SHADER_OPCODE_LOAD_PAYLOAD)
> +      return false;
> +
> +   assert(!tex_inst->eot); /* We can't get here twice */
> +   assert((tex_inst->offset & (0xff << 24)) == 0);
> +
> +   tex_inst->offset |= fb_write->target << 24;
> +   tex_inst->eot = true;
> +   fb_write->remove(cfg->blocks[cfg->num_blocks - 1]);
> +
> +   /* If a header is present, marking the eot is sufficient. Otherwise, we need
> +    * to create a new LOAD_PAYLOAD command with the same sources and a space
> +    * saved for the header. Using a new destination register not only makes sure
> +    * we have enough space, but it will make sure the dead code eliminator kills
> +    * the instruction that this will replace.
> +    */
> +   if (tex_inst->header_present)
> +      return true;
> +
> +   fs_reg send_header = vgrf(load_payload->sources + 1);
> +   fs_reg *new_sources =
> +      ralloc_array(mem_ctx, fs_reg, load_payload->sources + 1);
> +
> +   new_sources[0] = fs_reg();
> +   for (int i = 0; i < load_payload->sources; i++)
> +      new_sources[i+1] = load_payload->src[i];
> +
> +   /* The LOAD_PAYLOAD helper seems like the obvious choice here. However, it
> +    * requires a lot of information about the sources to appropriately figure
> +    * out the number of registers needed to be used. Given this stage in our
> +    * optimization, we may not have the appropriate GRFs required by
> +    * LOAD_PAYLOAD at this point (copy propagation). Therefore, we need to
> +    * manually emit the instruction.
> +    */
> +   fs_inst *new_load_payload = new(mem_ctx) fs_inst(SHADER_OPCODE_LOAD_PAYLOAD,
> +                                                    load_payload->exec_size,
> +                                                    send_header,
> +                                                    new_sources,
> +                                                    load_payload->sources + 1);

You don't have to change it if you don't want to, but I often try to
indent this like

fs_inst *new_load_payload =
   new(mem_ctx) fs_inst(SHADER_OPCODE_LOAD_PAYLOAD, ...

aligning parameters that don't fit on the new(mem_ctx) line with the
first parameter to fs_inst.

> +
> +   new_load_payload->regs_written = load_payload->regs_written + 1;
> +   tex_inst->mlen++;
> +   tex_inst->header_present = true;
> +   tex_inst->insert_before(cfg->blocks[cfg->num_blocks - 1], new_load_payload);
> +   tex_inst->src[0] = send_header;

Set the tex_inst's destination to reg_null_ud or something here so that...

> +
> +   return true;
> +}
> +
>  bool
>  fs_visitor::opt_register_renaming()
>  {
> @@ -3763,6 +3853,8 @@ fs_visitor::optimize()
>
>     pass_num = 0;
>
> +   OPT(opt_sampler_eot);
> +
>     if (OPT(lower_load_payload)) {
>        split_virtual_grfs();
>        OPT(register_coalesce);
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
> index 278a8ee..fcb4b63 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.h
> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
> @@ -231,6 +231,9 @@ public:
>     bool compute_to_mrf();
>     bool dead_code_eliminate();
>     bool remove_duplicate_mrf_writes();
> +
> +   void combine_send_tex(bblock_t *block, fs_inst *tex_inst, fs_inst *fb_write);
> +   bool opt_sampler_eot();
>     bool virtual_grf_interferes(int a, int b);
>     void schedule_instructions(instruction_scheduler_mode mode);
>     void insert_gen4_send_dependency_workarounds();
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> index 40e51aa..1eb468f 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> @@ -517,6 +517,7 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
>     int rlen = 4;
>     uint32_t simd_mode;
>     uint32_t return_format;
> +   bool is_combined_send = inst->eot;
>
>     switch (dst.type) {
>     case BRW_REGISTER_TYPE_D:
> @@ -676,6 +677,11 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
>        dst = vec16(dst);
>     }
>
> +   if (is_combined_send) {
> +      assert(brw->gen >= 9 || brw->is_cherryview);
> +      rlen = 0;
> +   }
> +
>     assert(brw->gen < 7 || !inst->header_present ||
>            src.file == BRW_GENERAL_REGISTER_FILE);
>
> @@ -781,6 +787,12 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
>         * so has already done marking.
>         */
>     }
> +
> +   if (is_combined_send) {
> +      brw_inst_set_eot(brw, brw_last_inst, true);
> +      brw_set_dest(p, brw_last_inst, brw_null_reg());

... you don't need to set brw_set_dest here.

Also, I don't think you need to brw_inst_set_eot here. Since you're
setting tex_inst->eot it'll be set by brw_set_message_descriptor().

With those things fixed,

Reviewed-by: Matt Turner <mattst88 at gmail.com>

Thanks Ben. This is a pretty cool optimization.

> +      brw_inst_set_opcode(brw, brw_last_inst, BRW_OPCODE_SENDC);
> +   }
>  }
>
>
> --
> 2.3.5
>


More information about the mesa-dev mailing list