最快的方式来填充具有特定值的向量(SSE2)。模板友好 [英] fastest way to fill a vector (SSE2) with a certain value. Templates friendly

查看:412
本文介绍了最快的方式来填充具有特定值的向量(SSE2)。模板友好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个模板类:

template<size_t D>
struct A{
    double v_sse __attribute__ ((vector_size (8*D)));
    A(double val){
        //what here?
    }
};

填充 v_sse 字段与 val 的副本?由于我使用向量,我可以使用gcc SSE2内在函数。

What's the best way to fill the v_sse field with copies of val? Since I use vectors, I can use gcc SSE2 intrinsics.

推荐答案

如果我们可以编写代码,

It would be nice if we could write code once, and compile it for wider vectors with just a small tweak, even in cases where auto-vectorization doesn't do the trick.

我得到了与@hirschhornsalz相同的结果:巨大的,当使用大于HW支持的向量大小的向量来实例化时,这是无效的代码。例如在没有AVX512的情况下构造 A <8> 产生64位 mov vmovsd 指令。它会在堆栈上向一个本地广播,然后分别读回所有这些值,并将它们写入调用者的结构体返回缓冲区。

I got the same result as @hirschhornsalz: massive, inefficient code when instantiating this with vectors bigger than HW-supported vector sizes. e.g. constructing A<8> without AVX512 produces a boatload of 64bit mov and vmovsd instructions. It does one broadcast to a local on the stack, and then reads back all of those values separately, and writes them to the caller's struct-return buffer.

对于x86,对于需要 double arg(在xmm0中)的函数,我们可以获取gcc发出最佳广播,并返回向量/ zmm0),每个标准调用约定:

For x86, we can get gcc to emit optimal broadcasts for a function that takes a double arg (in xmm0), and returns a vector (in x/y/zmm0), per standard calling conventions:


  • SSE2: unpckpd xmm0,xmm0

  • SSE3: movddup xmm0,xmm0

  • AVX: vmovddup xmm0,xmm0 / vinsertf128 ymm0,ymm0,xmm0,1

    (AVX1只包括 vbroadcastsd ymm,m64

  • AVX2: vbroadcastsd ymm0,xmm0

  • AVX512: vbroadcastsd zmm0,xmm0 。 (注意,AVX512可以即时从mem广播:

    VADDPD zmm1 {k1} {z},zmm2,zmm3 / m512 / m64bcst {er}

    {k1} {z} 表示它可以使用掩码寄存器作为结果中的合并或零掩码。

    m64bcst 表示要广播的64位内存地址。

    {er} 表示MXCSR

    IDK如果gcc将使用此广播寻址模式将广播加载折叠到内存操作数中。

  • SSE2: unpckpd xmm0, xmm0
  • SSE3: movddup xmm0, xmm0
  • AVX: vmovddup xmm0, xmm0 / vinsertf128 ymm0, ymm0, xmm0, 1
    (AVX1 only includes the vbroadcastsd ymm, m64 form, which would presumably get used if inlined at call on data in memory)
  • AVX2: vbroadcastsd ymm0, xmm0
  • AVX512: vbroadcastsd zmm0, xmm0. (Note that AVX512 can broadcast from mem on the fly:
    VADDPD zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst{er}
    {k1}{z} means it can use a mask register as a merge or zero mask into the result.
    m64bcst mean a 64bit memory address to be broadcast.
    {er} means the MXCSR rounding mode can be overridden for this one instruction.
    IDK if gcc will use this broadcast addressing mode to fold broadcast-loads into memory operands.

但是, gcc也理解shuffle ,对任意向量大小有 __ builtin_shuffle all-zeros,shuffle成为一个广播,gcc使用最好的指令为作业。

However, gcc also understands shuffles, and has __builtin_shuffle for arbitrary vector sizes. With a compile-time constant mask of all-zeros, the shuffle becomes a broadcast, which gcc does using the best instruction for the job.

typedef int64_t v4di __attribute__ ((vector_size (32)));
typedef double  v4df __attribute__ ((vector_size (32)));
v4df vecinit4(double v) {
    v4df v_sse;
    typeof (v_sse) v_low = {v};
    v4di shufmask = {0};
    v_sse = __builtin_shuffle (v_low, shufmask );
    return v_sse;
}

在模板函数中,gcc 4.9.2似乎有一个问题,向量与元素的宽度和数量相同,并且掩码是int向量。它错误,甚至没有实例化模板,所以也许这就是为什么它有类型的问题。如果我复制类并将其模板化为特定的向量大小,一切都能正常工作。

In template functions, gcc 4.9.2 appears to have a problem recognizing that both vectors are the same width and number of elements, and that the mask is an int vector. It errors even without instantiating the template, so maybe that's why it has a problem with the types. Everything works perfectly if I copy the class and un-template it to a specific vector size.

template<int D> struct A{
    typedef double  dvec __attribute__ ((vector_size (8*D)));
    typedef int64_t ivec __attribute__ ((vector_size (8*D)));
    dvec v_sse;  // typeof(v_sse) is buggy without this typedef, in a template class
    A(double v) {
#ifdef SHUFFLE_BROADCAST  // broken on gcc 4.9.2
    typeof(v_sse)  v_low = {v};
    //int64_t __attribute__ ((vector_size (8*D))) shufmask = {0};
    ivec shufmask = {0, 0};
    v_sse = __builtin_shuffle (v_low, shufmask);  // no idea why this doesn't compile
#else
    typeof (v_sse) zero = {0, 0};
    v_sse = zero + v;  // doesn't optimize away without -ffast-math
#endif
    }
};

/*  doesn't work:
double vec2val  __attribute__ ((vector_size (16))) = {v, v};
double vec4val  __attribute__ ((vector_size (32))) = {v, v, v, v};
v_sse = __builtin_choose_expr (D == 2, vec2val, vec4val);
*/



我设法得到gcc内部编译错误, code> -O0 。向量+模板似乎需要一些工作。 (至少,它回到了gcc 4.9.2,Ubuntu目前正在发货。)。

I managed to get gcc to internal-compiler-error when compiling with -O0. vectors + templates appears to need some work. (At least, it did back in gcc 4.9.2 which Ubuntu is currently shipping. Upstream may have improved.)

我留下的第一个想法回退,因为shuffle不编译,是gcc隐式广播时,你使用一个运算符与向量和标量。因此,例如,向全零向量添加标量将会执行。

The first idea I had, which I left in as a fallback because shuffle doesn't compile, is that gcc implicitly broadcasts when you use an operator with a vector and a scalar. So for example, adding a scalar to a vector of all-zeroes will do the trick.

问题是实际的添加不会被优化,除非你使用 -ffast-math 。不幸的是需要 -funsafe-math-optimizations ,而不仅仅是 -fno-signaling-nans 。我尝试了 + 的替代方法,不能引起FPU异常,例如 ^ (xor)和 | (或),但gcc不会在 double 上执行。 运算符不会为<$​​ c $ c>标量生成向量结果,向量。

The problem is that the actual add won't be optimized away unless you use -ffast-math. -funsafe-math-optimizations is unfortunately required, not just -fno-signaling-nans. I tried alternatives to + that can't cause FPU exceptions, such as ^ (xor) and | (or), but gcc won't do those on doubles. The , operator doesn't produce a vector result for scalar , vector.

这可以通过专门化的模板与直接的初始化列表。如果你不能得到一个好的通用构造函数工作,我建议保留定义,所以当没有专门化时,你会得到一个编译错误。

This can be worked around by specializing the template with straightforward initializer lists. If you can't get a good generic constructor to work, I suggest leaving out the definition so you get a compile error when there isn't a specialization.

#ifndef NO_BROADCAST_SPECIALIZE
// specialized versions with initializer lists to work efficiently even without -ffast-math
// inline keyword prevents an actual definition from being emitted.
template<> inline A<2>::A (double v) {
    typeof (v_sse) val = {v, v};
    v_sse = val;
}
template<> inline A<4>::A (double v) {
    typeof (v_sse) val = {v, v, v, v};
    v_sse = val;
}
template<> inline A<8>::A (double v) {
    typeof (v_sse) val = {v, v, v, v, v, v, v, v};
    v_sse = val;
}
template<> inline A<16>::A (double v) { // AVX1024 or something may exist someday
    typeof (v_sse) val = {v, v, v, v, v, v, v, v, v, v, v, v, v, v, v, v};
    v_sse = val;
}
#endif

现在,测试结果:

// vecinit4 (from above) included in the asm output too.
// instantiate the templates
A<2> broadcast2(double val) { return A<2>(val); }
A<4> broadcast4(double val) { return A<4>(val); }
A<8> broadcast8(double val) { return A<8>(val); }

编译器输出(汇编器指令被删除):

Compiler output (assembler directives stripped out):

g++ -DNO_BROADCAST_SPECIALIZE  -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-

_Z8vecinit4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast2d:
    vmovddup        xmm1, xmm0
    vxorpd  xmm0, xmm0, xmm0
    vaddpd  xmm0, xmm1, xmm0
    ret
_Z10broadcast4d:
    vbroadcastsd    ymm1, xmm0
    vxorpd  xmm0, xmm0, xmm0
    vaddpd  ymm0, ymm1, ymm0
    ret
_Z10broadcast8d:
    vbroadcastsd    zmm0, xmm0
    vpxorq  zmm1, zmm1, zmm1
    vaddpd  zmm0, zmm0, zmm1
    ret


g++ -O3 -Wall -mavx512f -march=native vec-gcc.cc -S -masm=intel -o-
# or   g++ -ffast-math -DNO_BROADCAST_SPECIALIZE blah blah.

_Z8vecinit4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast2d:
    vmovddup        xmm0, xmm0
    ret
_Z10broadcast4d:
    vbroadcastsd    ymm0, xmm0
    ret
_Z10broadcast8d:
    vbroadcastsd    zmm0, xmm0
    ret

注意shuffle方法应该正常工作,如果你不模板这个,而是改为只使用一个向量大小的代码。所以从SSE到AVX的改变就像在一个地方改变16到32一样简单。但是,您需要多次编译同一个文件以生成SSE版本和AVX版本,您可以在运行时分派它们。 (不过,您可能需要使用128位SSE版本,而不使用VEX指令编码。)

Note that the shuffle method should work fine if you don't template this, but instead only use one vector size in your code. So changing from SSE to AVX is as easy as changing 16 to 32 in one place. But then you'd need to compile the same file multiple times to generate an SSE version and an AVX version which you could dispatch to at runtime. (You might need that anyway, though, to have a 128bit SSE version that didn't use VEX instruction encoding.)

这篇关于最快的方式来填充具有特定值的向量(SSE2)。模板友好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆