[Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

Wed May 30 14:30:10 UTC 2018

On May 30, 2018 06:45:29 Eero Tamminen <eero.t.tamminen at intel.com> wrote:

> Hi,
>
> On 29.05.2018 18:58, Eero Tamminen wrote:
>> On 25.05.2018 00:55, Jason Ekstrand wrote:
>>> This patch series adds back-end compiler support for SIMD32 fragment
>>> shaders.  Support is added and everything works but it's currently hidden
>>> behind INTEL_DEBUG=do32.  We know that it improves performance in some
>>> cases but we do not yet have a good enough heuristic to start turning
>>> it on
>>> by default.  The objective of this series is to just to get the compiler
>>> infrastructure landed so that it stops bit-rotting in Curro's branch.
>>
>> Tested v3 on BXT & SKL.  Everything seems to work fine.
>
> Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests
> invoke GPU hangs on GEN7 (BYT & HSW GT2).

That problem is known.  It's caused by using SIMD32 shaders for fast 
clears.  The SIMD32 replicated clear shaders were added on at the last 
minute and didn't get good enough testing before sending out the series.  
We can either drop those two patches and modify the last one to not do 
SIMD32 when use_replicated_clear is set or I have another patch which just 
disables SIMD32 for fast clears.

>
> One option would be to support SIMD32 just for GEN8+.
>
>
>> Tested-by Eero Tamminen <eero.t.tamminen at intel.com>
>>
>>
>>> Figuring out a good heuristic is left as an exercise to the reader. :-)
>>
>> Simple heuristic that just enables SIMD32 for everything that isn't
>> MRT shader, gives nice perf improvements on BXT J4205:
>> * +30% GfxBench ALU2
>> * +25% SynMark PSPom
>> * +10% GpuTest Julia32
>> * +9% GfxBench CarChase
>> * +7% GfxBench Manhattan 3.0
>> * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
>> * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
>> * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*,
>> VSInstancing & ZBuffer
>> * -2-3% GLB 2.7 Fill
>> * -4-5% MemBW Blend
>>
>> On SKL, perf differences are smaller.
>
> On GEN8, the improvements are smaller and regressions larger with
> the same heuristic.
>
> Main difference with the 12EU single channel BSW, is -15% regression
> in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing
> out their average value.  With single-channel memory, increased memory
> latency causes a lot more trashing with SIMD32 when many textures are
> being sampled close together.
>
>
>> SIMD32 can cause write bound tests to trash, which is visible as perf
>> regression in fully write bound tests above (that's also the reason
>> why SIMD32 is good to disable with MRT shaders).
>>
>> As to reads, SIMD32 improves cache locality until it starts trashing.
>> In above GfxBench tests, and amount of texture sampling they do, this
>> shows in HW counters as increased texture cache misses (trashing), but
>> less L3 misses (better locality).  Along with (more important) better
>> latency compensation, these explain why SIMD32 improves performance in
>> them.
>>
>>
>> More advanced heuristics that try to avoid the SIMD32 performance
>> regressions, unfortunately also get rid of clear part of the above
>> improvements.  Such heuristics would need improved instruction scheduler
>
> Heuristics for things affecting texture fetch latencies would help, like
> how many fetches there are, to how many different textures and how close
> together they are vs. how large caches there are and how fast RAM.
>
>
> - Eero
>
>> that provides feedback on which shaders have latency issues where SIMD32
>> would help.
>>
>> (A potential run-time heuristics would be disabling SIMD32 when too
>> large textures are bound for draw.)
>>
>>
>>    - Eero
>>
>>> Francisco Jerez (34):
>>>   intel/eu: Remove brw_codegen::compressed_stack.
>>>   intel/fs: Rename a local variable so it doesn't shadow component()
>>>   intel/fs: Use the ATTR file for FS inputs
>>>   intel/fs: Replace the CINTERP opcode with a simple MOV
>>>   intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
>>>   intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
>>>     writes.
>>>   intel/eu: Return new instruction to caller from brw_fb_WRITE().
>>>   intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
>>>   intel/fs: Fix implied_mrf_writes() for headerless FB writes.
>>>   intel/fs: Remove program key argument from generator.
>>>   intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
>>>   intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
>>>   intel/eu: Fix pixel interpolator queries for SIMD32.
>>>   intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
>>>   intel/fs: Don't enable dual source blend if no outputs are written
>>>   intel/fs: Fix FB write message control codegen for SIMD32.
>>>   intel/fs: Fix logical FB write lowering for SIMD32
>>>   intel/fs: Fix FB read header setup for SIMD32.
>>>   intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
>>>   intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
>>>     pre-Gen7.
>>>   intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
>>>   i965: Add plumbing for shader time in 32-wide FS dispatch mode.
>>>   intel/fs: Simplify fs_visitor::emit_samplepos_setup
>>>   intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
>>>     workaround
>>>   intel/fs: Wrap FS payload register look-up in a helper function.
>>>   intel/fs: Extend thread payload layout to SIMD32
>>>   intel/fs: Implement 32-wide FS payload setup on Gen6+
>>>   intel/fs: Fix Gen7 compressed source region alignment restriction for
>>>     SIMD32
>>>   intel/fs: Fix sample id setup for SIMD32.
>>>   intel/fs: Generalize the unlit centroid workaround
>>>   intel/fs: Fix Gen6+ interpolation setup for SIMD32
>>>   intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
>>>   intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
>>>   intel/fs: Build 32-wide FS shaders.
>>>
>>> Jason Ekstrand (19):
>>>   intel/fs: Assert that the gen4-6 plane restrictions are followed
>>>   intel/fs: Use groups for SIMD16 LINTERP on gen11+
>>>   intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
>>>   intel/fs: Properly track implied header regs read by FB writes
>>>   intel/fs: Pull FB write implied headers from src[0]
>>>   intel/fs: Set up FB write message headers in the visitor
>>>   i965: Re-arrange shader kernel setup in WM state
>>>   intel/compiler: Add and use helpers for working with KSP indices
>>>   intel/fs: Rework KSP data to be SIMD width-based
>>>   intel/fs: Split instructions low to high in lower_simd_width
>>>   intel/fs: Properly copy default flag reg for 3src instrucitons
>>>   intel/fs: Add the group to the flag subreg number on SNB and older
>>>   intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates
>>>   intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround
>>>   intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS
>>>   intel/fs: Add fields to wm_prog_data for SIMD32 dispatch
>>>   intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround
>>>   intel/fs: Remove support push constants in repclear shaders
>>>   intel/fs: Support SIMD32 repclear shaders
>>>
>>>  src/intel/blorp/blorp.c                       |   2 +-
>>>  src/intel/blorp/blorp_genX_exec.h             |  82 +++-
>>>  src/intel/compiler/brw_compiler.h             |  98 +++-
>>>  src/intel/compiler/brw_eu.h                   |  21 +-
>>>  src/intel/compiler/brw_eu_defines.h           |   2 -
>>>  src/intel/compiler/brw_eu_emit.c              |  39 +-
>>>  src/intel/compiler/brw_fs.cpp                 | 666
>>> ++++++++++++++++----------
>>>  src/intel/compiler/brw_fs.h                   |  53 +-
>>>  src/intel/compiler/brw_fs_builder.h           |   6 +-
>>>  src/intel/compiler/brw_fs_cse.cpp             |   1 -
>>>  src/intel/compiler/brw_fs_generator.cpp       | 318 ++++++------
>>>  src/intel/compiler/brw_fs_nir.cpp             |  57 ++-
>>>  src/intel/compiler/brw_fs_visitor.cpp         | 193 ++++----
>>>  src/intel/compiler/brw_ir_fs.h                |   1 +
>>>  src/intel/compiler/brw_shader.cpp             |  12 +-
>>>  src/intel/compiler/brw_vec4.cpp               |   2 +-
>>>  src/intel/compiler/brw_vec4_gs_visitor.cpp    |   2 +-
>>>  src/intel/compiler/brw_vec4_tcs.cpp           |   2 +-
>>>  src/intel/compiler/brw_wm_iz.cpp              |  11 +-
>>>  src/intel/vulkan/anv_pipeline.c               |   2 +-
>>>  src/intel/vulkan/genX_pipeline.c              |  40 +-
>>>  src/mesa/drivers/dri/i965/brw_context.h       |   1 +
>>>  src/mesa/drivers/dri/i965/brw_program.c       |   6 +
>>>  src/mesa/drivers/dri/i965/brw_wm.c            |   6 +-
>>>  src/mesa/drivers/dri/i965/gen4_blorp_exec.h   |  17 +-
>>>  src/mesa/drivers/dri/i965/genX_state_upload.c | 144 ++++--
>>>  26 files changed, 1101 insertions(+), 683 deletions(-)
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev