[Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
Eero Tamminen
eero.t.tamminen at intel.com
Tue May 29 16:05:28 UTC 2018
Hi,
On 29.05.2018 18:58, Eero Tamminen wrote:
> On 25.05.2018 00:55, Jason Ekstrand wrote:
>> This patch series adds back-end compiler support for SIMD32 fragment
>> shaders. Support is added and everything works but it's currently hidden
>> behind INTEL_DEBUG=do32. We know that it improves performance in some
>> cases but we do not yet have a good enough heuristic to start turning
>> it on
>> by default. The objective of this series is to just to get the compiler
>> infrastructure landed so that it stops bit-rotting in Curro's branch.
>
> Tested v3 on BXT & SKL. Everything seems to work otherwise fine.
s/otherwise//
- Eero
(regardless of how many times one reads a mail before sending, there
always seems to be some leftover one misses.)
> Tested-by Eero Tamminen <eero.t.tamminen at intel.com>
>
>
>> Figuring out a good heuristic is left as an exercise to the reader. :-)
>
> Simple heuristic that just enables SIMD32 for everything that isn't
> MRT shader, gives nice perf improvements on BXT J4205:
> * +30% GfxBench ALU2
> * +25% SynMark PSPom
> * +10% GpuTest Julia32
> * +9% GfxBench CarChase
> * +7% GfxBench Manhattan 3.0
> * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
> * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
> * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*,
> VSInstancing & ZBuffer
> * -2-3% GLB 2.7 Fill
> * -4-5% MemBW Blend
>
> On SKL, perf differences are smaller.
>
> SIMD32 can cause write bound tests to trash, which is visible as perf
> regression in fully write bound tests above (that's also the reason
> why SIMD32 is good to disable with MRT shaders).
>
> As to reads, SIMD32 improves cache locality until it starts trashing.
> In above GfxBench tests, and amount of texture sampling they do, this
> shows in HW counters as increased texture cache misses (trashing), but
> less L3 misses (better locality). Along with (more important) better
> latency compensation, these explain why SIMD32 improves performance in
> them.
>
>
> More advanced heuristics that try to avoid the SIMD32 performance
> regressions, unfortunately also get rid of clear part of the above
> improvements. Such heuristics would need improved instruction scheduler
> that provides feedback on which shaders have latency issues where SIMD32
> would help.
>
> (A potential run-time heuristics would be disabling SIMD32 when too
> large textures are bound for draw.)
>
>
> - Eero
>
>> Francisco Jerez (34):
>> intel/eu: Remove brw_codegen::compressed_stack.
>> intel/fs: Rename a local variable so it doesn't shadow component()
>> intel/fs: Use the ATTR file for FS inputs
>> intel/fs: Replace the CINTERP opcode with a simple MOV
>> intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
>> intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
>> writes.
>> intel/eu: Return new instruction to caller from brw_fb_WRITE().
>> intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
>> intel/fs: Fix implied_mrf_writes() for headerless FB writes.
>> intel/fs: Remove program key argument from generator.
>> intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
>> intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
>> intel/eu: Fix pixel interpolator queries for SIMD32.
>> intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
>> intel/fs: Don't enable dual source blend if no outputs are written
>> intel/fs: Fix FB write message control codegen for SIMD32.
>> intel/fs: Fix logical FB write lowering for SIMD32
>> intel/fs: Fix FB read header setup for SIMD32.
>> intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
>> intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
>> pre-Gen7.
>> intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
>> i965: Add plumbing for shader time in 32-wide FS dispatch mode.
>> intel/fs: Simplify fs_visitor::emit_samplepos_setup
>> intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
>> workaround
>> intel/fs: Wrap FS payload register look-up in a helper function.
>> intel/fs: Extend thread payload layout to SIMD32
>> intel/fs: Implement 32-wide FS payload setup on Gen6+
>> intel/fs: Fix Gen7 compressed source region alignment restriction for
>> SIMD32
>> intel/fs: Fix sample id setup for SIMD32.
>> intel/fs: Generalize the unlit centroid workaround
>> intel/fs: Fix Gen6+ interpolation setup for SIMD32
>> intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
>> intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
>> intel/fs: Build 32-wide FS shaders.
>>
>> Jason Ekstrand (19):
>> intel/fs: Assert that the gen4-6 plane restrictions are followed
>> intel/fs: Use groups for SIMD16 LINTERP on gen11+
>> intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
>> intel/fs: Properly track implied header regs read by FB writes
>> intel/fs: Pull FB write implied headers from src[0]
>> intel/fs: Set up FB write message headers in the visitor
>> i965: Re-arrange shader kernel setup in WM state
>> intel/compiler: Add and use helpers for working with KSP indices
>> intel/fs: Rework KSP data to be SIMD width-based
>> intel/fs: Split instructions low to high in lower_simd_width
>> intel/fs: Properly copy default flag reg for 3src instrucitons
>> intel/fs: Add the group to the flag subreg number on SNB and older
>> intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates
>> intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround
>> intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS
>> intel/fs: Add fields to wm_prog_data for SIMD32 dispatch
>> intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround
>> intel/fs: Remove support push constants in repclear shaders
>> intel/fs: Support SIMD32 repclear shaders
>>
>> src/intel/blorp/blorp.c | 2 +-
>> src/intel/blorp/blorp_genX_exec.h | 82 +++-
>> src/intel/compiler/brw_compiler.h | 98 +++-
>> src/intel/compiler/brw_eu.h | 21 +-
>> src/intel/compiler/brw_eu_defines.h | 2 -
>> src/intel/compiler/brw_eu_emit.c | 39 +-
>> src/intel/compiler/brw_fs.cpp | 666
>> ++++++++++++++++----------
>> src/intel/compiler/brw_fs.h | 53 +-
>> src/intel/compiler/brw_fs_builder.h | 6 +-
>> src/intel/compiler/brw_fs_cse.cpp | 1 -
>> src/intel/compiler/brw_fs_generator.cpp | 318 ++++++------
>> src/intel/compiler/brw_fs_nir.cpp | 57 ++-
>> src/intel/compiler/brw_fs_visitor.cpp | 193 ++++----
>> src/intel/compiler/brw_ir_fs.h | 1 +
>> src/intel/compiler/brw_shader.cpp | 12 +-
>> src/intel/compiler/brw_vec4.cpp | 2 +-
>> src/intel/compiler/brw_vec4_gs_visitor.cpp | 2 +-
>> src/intel/compiler/brw_vec4_tcs.cpp | 2 +-
>> src/intel/compiler/brw_wm_iz.cpp | 11 +-
>> src/intel/vulkan/anv_pipeline.c | 2 +-
>> src/intel/vulkan/genX_pipeline.c | 40 +-
>> src/mesa/drivers/dri/i965/brw_context.h | 1 +
>> src/mesa/drivers/dri/i965/brw_program.c | 6 +
>> src/mesa/drivers/dri/i965/brw_wm.c | 6 +-
>> src/mesa/drivers/dri/i965/gen4_blorp_exec.h | 17 +-
>> src/mesa/drivers/dri/i965/genX_state_upload.c | 144 ++++--
>> 26 files changed, 1101 insertions(+), 683 deletions(-)
>>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list