使用SSE / AVX固有特性时体系结构的影响 [英] The Effect of Architecture When Using SSE / AVX Intrinisics

查看:97
本文介绍了使用SSE / AVX固有特性时体系结构的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道编译器如何处理内部函数。

I wonder how does a Compiler treats Intrinsics.

如果一个人使用SSE2内部函数(使用 #include< emmintrin.h> )并使用 -mavx 标志进行编译。编译器将生成什么?它会生成AVX或SSE代码吗?

If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?

如果使用的是AVX2内部函数(使用 #include< immintrin.h> ),并使用 -msse2 标志进行编译。编译器将生成什么?它会生成SSE Only还是AVX代码?

If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?

编译器如何处理内部函数?

如果使用内部函数,是否可以帮助编译器理解依赖性

How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?

例如,这里发生了什么- https://godbolt.org/z/Y4J5OA (或 https:// godbolt .org / z / LZOJ2K )?

查看所有3个窗格。

For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.

我正在尝试使用不同的CPU功能构建相同功能的各种版本(SSE4和AVX2)。

我正在用SSE Intrinsics编写同一版本,一次是与AVX Intrinsics编写。

假设它们的名称为 MyFunSSE() MyFunAVX()。两者都在同一个文件中。

I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.

如何使编译器(对于MSVC,GCC和ICC应该使用相同的方法)仅使用各自的功能来构建它们?

How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?

推荐答案

GCC和clang要求您启用所有使用的扩展名。否则,这是一个编译时错误,例如错误:内联无法调用always_inline 错误:内联调用always_inline'__m256d _mm256_mask_loadu_pd(__ m256d,__mmask8,const void *)'失败:目标特定选项不匹配

GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch

使用 -march = haswell 或其他启用特定扩展名的方法,因为它还会设置适当的调整选项。而且您不会忘记 -mpopcnt 这样的有用工具,它们可以让 std :: bitset :: count()内联 popcnt 指令,并使用BMI2 shlx / shrx使所有变量计数移位更有效(1 uop vs. 3)

Using -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)

MSVC和ICC不会,并且

应该肯定会启用AVX,从而使您可以使用内部函数来发出它们无法自动向量化的指令。您使用AVX内部函数。我想我已经读过/看到没有了,MSVC不会总是在应该使用的地方使用 vzeroupper

You should definitely enable AVX if you use AVX intrinsics. I think I've read / seen that without that, MSVC won't always use vzeroupper where it should.

对于支持GNU扩展(GCC,clang,ICC)的编译器,您可以使用 __ attribute __((target( avx)))之类的东西有关编译单元中的特定功能。或者更好的是, __ attribute __((target( arch = haswell)))也可以设置调整选项。 (但是这也会启用您可能不需要的AVX2和FMA。我不确定 target 属性是否可以设置 -mtune = xx

For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to also set tuning options. (But that also enables AVX2 and FMA, which you might not want. I'm not sure if target attributes can set -mtune=xx)

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes (还有

__ attribute __((target()))将阻止它们内联到具有其他目标选项的函数中,因此请小心如果函数本身太小,请在它们将内联的函数上使用此函数。

__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small.

另请参见
https://gcc.gnu.org/wiki/FunctionMultiVersioning 用于在 same 函数名,用于编译器支持的运行时分派。但是我不认为有一个可移植的(

See also https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.

有了MSVC,您不需要任何东西,尽管就像我说的那样,我认为这很正常在不使用 -arch:AVX 的情况下使用AVX内部函数是一个坏主意,因此最好将它们放在单独的文件中。但是对于AVX vs. AVX2 + FMA或SSE2 vs. SSE4.2,您可以没有任何问题。

With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.

只需 #define AVX2_FUNCTION 到空字符串,而不是 __ attribute __((target( avx2,fma)))

Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))

例如

#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL   // empty
 // maybe warn if __AVX__ isn't defined for functions where this is used?
 // if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif


TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src) {
    __m256 v = _mm256_loadu_ps(src);
    ...
    ...
}

使用GCC和clang,宏扩展为 __ attribute __((target))东西;

With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.

https:// software .intel.com / zh-cn / cpp-compiler-developer-guide-and-reference-optimization-parameter 记录了您要在AVX函数之前放入的杂注,以确保vzeroupper在函数中正确使用使用 _mm256 内在函数。

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.

#pragma intel optimization_parameter target_arch=AVX

对于ICC,您可以 #define TARGET_AVX 作为这个,并且总是在函数之前单独使用它,您可以在其中放置 __ attribute __ 或杂注。如果ICC不想在声明中使用此宏,则可能还需要使用单独的宏来定义和声明函数。如果要在非AVX功能之后添加宏,则可以结束一个AVX功能块。 (对于非ICC编译器,此字段为空。)

For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)

这篇关于使用SSE / AVX固有特性时体系结构的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆