[Mesa-dev] [PATCH 09/13] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

Scott D Phillips scott.d.phillips at intel.com
Mon Apr 30 23:34:08 UTC 2018


Matt Turner <mattst88 at gmail.com> writes:

> On Mon, Apr 30, 2018 at 10:25 AM, Scott D Phillips
> <scott.d.phillips at intel.com> wrote:
>> The reference for MOVNTDQA says:
>>
>>     For WC memory type, the nontemporal hint may be implemented by
>>     loading a temporary internal buffer with the equivalent of an
>>     aligned cache line without filling this data to the cache.
>>     [...] Subsequent MOVNTDQA reads to unread portions of the WC
>>     cache line will receive data from the temporary internal
>>     buffer if data is available.
>>
>> This hidden cache line sized temporary buffer can improve the
>> read performance from wc maps.
>>
>> v2: Add mfence at start of tiled_to_linear for streaming loads (Chris)
>>
>> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
>
> I think I understand the mechanics of this change. Let me tell you
> what I think is going on and you can confirm my understanding.
>
> We want to inline a movntdqa-using memcpy function into the tiling
> functions for performance reasons. Since movntdqa is an SSE4.1
> instruction, we must split out intel_tiled_memcpy.c into its own
> convenience library so it can be built with -msse4.1.
>
> Because the existing _mesa_streaming_load_memcpy is compiled into its
> own object file, we can't use it directly (and we might not want to
> use it directly anyway, since it calls _mm_mfence() we can just do it
> once at the very end of the calling function). So we implement a
> simple memcpy with movntdqa capable of handling only 16-bytes or
> 64-bytes.
>
> Somewhat oddly, we pass in _mesa_streaming_load_memcpy (which we
> cannot inline) and then actually call our simpler memcpy. It's safe to
> only protect the SSE4.1-using code with #if defined(USE_SSE41) because
> _mesa_streaming_load_memcpy is only passed if cpu_has_sse4_1 is true
> (in the next patch).
>
> Did I miss anything?

You've got it exactly.

>> ---
>>  src/mesa/drivers/dri/i965/Makefile.am          |  7 +++
>>  src/mesa/drivers/dri/i965/Makefile.sources     |  6 ++-
>>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 62 ++++++++++++++++++++++++++
>>  src/mesa/drivers/dri/i965/meson.build          | 18 ++++++--
>>  4 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/Makefile.am b/src/mesa/drivers/dri/i965/Makefile.am
>> index 889d4c68a2b..ff47add93f4 100644
>> --- a/src/mesa/drivers/dri/i965/Makefile.am
>> +++ b/src/mesa/drivers/dri/i965/Makefile.am
>> @@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) -DGEN_VERSIONx10=110
>>
>>  noinst_LTLIBRARIES = \
>>         libi965_dri.la \
>> +       libintel_tiled_memcpy.la \
>>         $(I965_PERGEN_LIBS)
>>
>> +libintel_tiled_memcpy_la_SOURCES = \
>> +       $(intel_tiled_memcpy_FILES)
>> +libintel_tiled_memcpy_la_CFLAGS = \
>> +       $(AM_CFLAGS) $(SSE41_CFLAGS)
>> +
>>  libi965_dri_la_SOURCES = \
>>         $(i965_FILES) \
>>         $(i965_oa_GENERATED_FILES)
>> @@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \
>>         $(top_builddir)/src/intel/compiler/libintel_compiler.la \
>>         $(top_builddir)/src/intel/blorp/libblorp.la \
>>         $(I965_PERGEN_LIBS) \
>> +       libintel_tiled_memcpy.la
>>         $(LIBDRM_LIBS)
>>
>>  BUILT_SOURCES = $(i965_oa_GENERATED_FILES)
>> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
>> index 5e53d874d88..a15bba50eec 100644
>> --- a/src/mesa/drivers/dri/i965/Makefile.sources
>> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
>> @@ -112,11 +112,13 @@ i965_FILES = \
>>         intel_tex_image.c \
>>         intel_tex_obj.h \
>>         intel_tex_validate.c \
>> -       intel_tiled_memcpy.c \
>> -       intel_tiled_memcpy.h \
>>         intel_upload.c \
>>         libdrm_macros.h
>>
>> +intel_tiled_memcpy_FILES = \
>> +       intel_tiled_memcpy.c \
>> +       intel_tiled_memcpy.h
>> +
>>  i965_gen4_FILES = \
>>         genX_blorp_exec.c \
>>         genX_state_upload.c
>> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>> index 7c6bde990d6..fac5427d2ed 100644
>> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>> @@ -36,6 +36,10 @@
>>  #include "brw_context.h"
>>  #include "intel_tiled_memcpy.h"
>>
>> +#if defined(USE_SSE41)
>> +#include "main/streaming-load-memcpy.h"
>> +#include <smmintrin.h>
>> +#endif
>>  #if defined(__SSSE3__)
>>  #include <tmmintrin.h>
>>  #elif defined(__SSE2__)
>> @@ -213,6 +217,31 @@ rgba8_copy_aligned_src(void *dst, const void *src, size_t bytes)
>>     return dst;
>>  }
>>
>> +#if defined(USE_SSE41)
>> +static ALWAYS_INLINE void *
>> +_memcpy_streaming_load(void *dest, const void *src, size_t count)
>> +{
>> +   if (count == 16) {
>
> After inlining and some optimization, the compiler will know for sure
> what count is, right? Curious if it will always be able to optimize
> away one of these branches.

Exactly, we go to some lengths to have the count be constant in the fast
paths, which will make it constant here when it is inlined. The
generated assembly for the inner loop winds up being 4 movntdqa
instructions in a row, just like we want.

>> +      __m128i val = _mm_stream_load_si128((__m128i *)src);
>> +      _mm_store_si128((__m128i *)dest, val);
>> +      return dest;
>> +   } else if (count == 64) {
>> +      __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0);
>> +      __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1);
>> +      __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2);
>> +      __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3);
>> +      _mm_store_si128(((__m128i *)dest) + 0, val0);
>> +      _mm_store_si128(((__m128i *)dest) + 1, val1);
>> +      _mm_store_si128(((__m128i *)dest) + 2, val2);
>> +      _mm_store_si128(((__m128i *)dest) + 3, val3);
>> +      return dest;
>> +   } else {
>> +      assert(count < 64); /* and (count < 16) for ytiled */
>> +      return memcpy(dest, src, count);
>
> I would just make this unreachable().

This is another bit of a wart. This case actually is reached, the
mem_copy_align16 functions in the tiling/detiling functions are just
meant to guarantee that the source and destination are 16-byte aligned,
but nothing about the length.

So, the mem_copy_align16 function will get called for the remnant of the
surface on the right side that is 16-byte aligned but not a multiple of
16 bytes.

We could also handle this by doing a streaming load and then maskmovdqu
to do a masked store. I suppose we know that the extra bytes read from
the 16-aligned load couldn't cross a page boundary. I just punted though
and used the system memcpy.


More information about the mesa-dev mailing list