用某个值填充向量(SSE2)的最快方法.模板友好 [英] fastest way to fill a vector (SSE2) with a certain value. Templates friendly

查看:28
本文介绍了用某个值填充向量(SSE2)的最快方法.模板友好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个模板类:

template<size_t D>
struct A{
    double v_sse __attribute__ ((vector_size (8*D)));
    A(double val){
        //what here?
    }
};

val 的副本填充 v_sse 字段的最佳方法是什么?由于我使用向量,因此我可以使用 gcc SSE2 内在函数.

What's the best way to fill the v_sse field with copies of val? Since I use vectors, I can use gcc SSE2 intrinsics.

推荐答案

如果我们可以编写一次代码,只需稍作调整就可以将其编译为更宽的向量,即使在自动向量化不支持的情况下,这将是很好的做这个伎俩.

It would be nice if we could write code once, and compile it for wider vectors with just a small tweak, even in cases where auto-vectorization doesn't do the trick.

我得到了与@hirschhornsalz 相同的结果:当使用大于硬件支持的向量大小的向量实例化它时,代码量很大,效率低下.例如在没有 AVX512 的情况下构造 A<8> 会产生大量 64 位 movvmovsd 指令.它向堆栈上的本地进行一次广播,然后分别读回所有这些值,并将它们写入调用者的 struct-return 缓冲区.

I got the same result as @hirschhornsalz: massive, inefficient code when instantiating this with vectors bigger than HW-supported vector sizes. e.g. constructing A<8> without AVX512 produces a boatload of 64bit mov and vmovsd instructions. It does one broadcast to a local on the stack, and then reads back all of those values separately, and writes them to the caller's struct-return buffer.

对于 x86,我们可以让 gcc 为接受 double 参数(在 xmm0 中)并返回向量(在 x/y 中)的函数发出最佳广播/zmm0),根据标准调用约定:

For x86, we can get gcc to emit optimal broadcasts for a function that takes a double arg (in xmm0), and returns a vector (in x/y/zmm0), per standard calling conventions:

  • SSE2:unpckpd xmm0, xmm0
  • SSE3:movddup xmm0, xmm0
  • AVX:vmovddup xmm0, xmm0/vinsertf128 ymm0, ymm0, xmm0, 1
    (AVX1 只包括 vbroadcastsd ymm, m64 形式,它会如果在调用内存中的数据时内联,大概会被使用)
  • AVX2:vbroadcastsd ymm0, xmm0
  • AVX512:vbroadcastsd zmm0, xmm0.(请注意,AVX512 可以即时从内存广播:
    VADDPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}
    {k1}{z} 表示它可以使用掩码寄存器作为结果中的合并或零掩码.
    m64bcst 表示要广播的 64 位内存地址.
    {er} 表示这条指令可以覆盖 MXCSR 舍入模式.
    IDK 如果 gcc 将使用这种广播寻址模式将广播加载折叠到内存操作数中.
  • SSE2: unpckpd xmm0, xmm0
  • SSE3: movddup xmm0, xmm0
  • AVX: vmovddup xmm0, xmm0 / vinsertf128 ymm0, ymm0, xmm0, 1
    (AVX1 only includes the vbroadcastsd ymm, m64 form, which would presumably get used if inlined at call on data in memory)
  • AVX2: vbroadcastsd ymm0, xmm0
  • AVX512: vbroadcastsd zmm0, xmm0. (Note that AVX512 can broadcast from mem on the fly:
    VADDPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}
    {k1}{z} means it can use a mask register as a merge or zero mask into the result.
    m64bcst mean a 64bit memory address to be broadcast.
    {er} means the MXCSR rounding mode can be overridden for this one instruction.
    IDK if gcc will use this broadcast addressing mode to fold broadcast-loads into memory operands.

但是,gcc 也可以理解 shuffles,并且具有用于任意向量大小的 __builtin_shuffle.使用全零的编译时常量掩码,shuffle 变成广播,gcc 使用最佳指令来完成这项工作.

However, gcc also understands shuffles, and has __builtin_shuffle for arbitrary vector sizes. With a compile-time constant mask of all-zeros, the shuffle becomes a broadcast, which gcc does using the best instruction for the job.

typedef int64_t v4di __attribute__ ((vector_size (32)));
typedef double  v4df __attribute__ ((vector_size (32)));
v4df vecinit4(double v) {
    v4df v_sse;
    typeof (v_sse) v_low = {v};
    v4di shufmask = {0};
    v_sse = __builtin_shuffle (v_low, shufmask );
    return v_sse;
}

在模板函数中,gcc 4.9.2 似乎有一个问题,即识别出两个向量的宽度和元素数量相同,并且掩码是一个 int 向量.即使没有实例化模板也会出错,所以也许这就是类型有问题的原因.如果我复制该类并将其取消模板化为特定的矢量大小,一切都会完美运行.

In template functions, gcc 4.9.2 appears to have a problem recognizing that both vectors are the same width and number of elements, and that the mask is an int vector. It errors even without instantiating the template, so maybe that's why it has a problem with the types. Everything works perfectly if I copy the class and un-template it to a specific vector size.

template<int D> struct A{
    typedef double  dvec __attribute__ ((vector_size (8*D)));
    typedef int64_t ivec __attribute__ ((vector_size (8*D)));
    dvec v_sse;  // typeof(v_sse) is buggy without this typedef, in a template class
    A(double v) {
#ifdef SHUFFLE_BROADCAST  // broken on gcc 4.9.2
    typeof(v_sse)  v_low = {v};
    //int64_t __attribute__ ((vector_size (8*D))) shufmask = {0};
    ivec shufmask = {0, 0};
    v_sse = __builtin_shuffle (v_low, shufmask);  // no idea why this doesn't compile
#else
    typeof (v_sse) zero = {0, 0};
    v_sse = zero + v;  // doesn't optimize away without -ffast-math
#endif
    }
};

/*  doesn't work:
double vec2val  __attribute__ ((vector_size (16))) = {v, v};
double vec4val  __attribute__ ((vector_size (32))) = {v, v, v, v};
v_sse = __builtin_choose_expr (D == 2, vec2val, vec4val);
*/

在使用 -O0 进行编译时,我设法让 gcc 出现内部编译器错误.向量+模板似乎需要一些工作.(至少,它在 Ubuntu 目前正在发布的 gcc 4.9.2 中确实存在.上游可能有所改进.)

I managed to get gcc to internal-compiler-error when compiling with -O0. vectors + templates appears to need some work. (At least, it did back in gcc 4.9.2 which Ubuntu is currently shipping. Upstream may have improved.)

我的第一个想法是,当您使用带有向量和标量的运算符时,gcc 会隐式广播,因为 shuffle 无法编译.因此,例如,将标量添加到全零向量就可以了.

The first idea I had, which I left in as a fallback because shuffle doesn't compile, is that gcc implicitly broadcasts when you use an operator with a vector and a scalar. So for example, adding a scalar to a vector of all-zeroes will do the trick.

问题在于,除非您使用 -ffast-math,否则实际添加不会被优化掉.不幸的是,需要 -funsafe-math-optimizations,而不仅仅是 -fno-signaling-nans.我尝试了不会导致 FPU 异常的 + 的替代方法,例如 ^ (xor) 和 | (or),但 gcc 不会t 在 double 上做这些., 运算符不会为 scalar , vector 生成矢量结果.

The problem is that the actual add won't be optimized away unless you use -ffast-math. -funsafe-math-optimizations is unfortunately required, not just -fno-signaling-nans. I tried alternatives to + that can't cause FPU exceptions, such as ^ (xor) and | (or), but gcc won't do those on doubles. The , operator doesn't produce a vector result for scalar , vector.

这可以通过使用简单的初始化列表专门化模板来解决.如果你不能让一个好的泛型构造函数工作,我建议省略定义,这样在没有特化时你会得到一个编译错误.

This can be worked around by specializing the template with straightforward initializer lists. If you can't get a good generic constructor to work, I suggest leaving out the definition so you get a compile error when there isn't a specialization.

#ifndef NO_BROADCAST_SPECIALIZE
// specialized versions with initializer lists to work efficiently even without -ffast-math
// inline keyword prevents an actual definition from being emitted.
template<> inline A<2>::A (double v) {
    typeof (v_sse) val = {v, v};
    v_sse = val;
}
template<> inline A<4>::A (double v) {
    typeof (v_sse) val = {v, v, v, v};
    v_sse = val;
}
template<> inline A<8>::A (double v) {
    typeof (v_sse) val = {v, v, v, v, v, v, v, v};
    v_sse = val;
}
template<> inline A<16>::A (double v) { // AVX1024 or something may exist someday
    typeof (v_sse) val = {v, v, v, v, v, v, v, v, v, v, v, v, v, v, v, v};
    v_sse = val;
}
#endif

现在,测试结果:

// vecinit4 (from above) included in the asm output too.
// instantiate the templates
A<2> broadcast2(double val) { return A<2>(val); }
A<4> broadcast4(double val) { return A<4>(val); }
A<8> broadcast8(double val) { return A<8>(val); }

编译器输出(汇编器指令被剥离):

Compiler output (assembler directives stripped out):

g++ -DNO_BROADCAST_SPECIALIZE  -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-

_Z8vecinit4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast2d:
    vmovddup        xmm1, xmm0
    vxorpd  xmm0, xmm0, xmm0
    vaddpd  xmm0, xmm1, xmm0
    ret
_Z10broadcast4d:
    vbroadcastsd    ymm1, xmm0
    vxorpd  xmm0, xmm0, xmm0
    vaddpd  ymm0, ymm1, ymm0
    ret
_Z10broadcast8d:
    vbroadcastsd    zmm0, xmm0
    vpxorq  zmm1, zmm1, zmm1
    vaddpd  zmm0, zmm0, zmm1
    ret


g++ -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-
# or   g++ -ffast-math -DNO_BROADCAST_SPECIALIZE blah blah.

_Z8vecinit4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast2d:
    vmovddup        xmm0, xmm0
    ret
_Z10broadcast4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast8d:
    vbroadcastsd    zmm0, xmm0
    ret

请注意,如果您不对其进行模板化,则 shuffle 方法应该可以正常工作,而是在代码中仅使用一种矢量大小.因此,从 SSE 更改为 AVX 就像在一个地方将 16 更改为 32 一样简单.但是,您需要多次编译同一个文件以生成一个 SSE 版本和一个 AVX 版本,您可以在运行时将其分派到.(不过,您可能需要它来获得不使用 VEX 指令编码的 128 位 SSE 版本.)

Note that the shuffle method should work fine if you don't template this, but instead only use one vector size in your code. So changing from SSE to AVX is as easy as changing 16 to 32 in one place. But then you'd need to compile the same file multiple times to generate an SSE version and an AVX version which you could dispatch to at runtime. (You might need that anyway, though, to have a 128bit SSE version that didn't use VEX instruction encoding.)

这篇关于用某个值填充向量(SSE2)的最快方法.模板友好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆