与内在和装配嵌入式广播 [英] Embedded broadcasts with intrinsics and assembly

查看:320
本文介绍了与内在和装配嵌入式广播的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在2.5.3节的英特尔架构指令集扩展编程参考比我们学习
AVX512(和骑士角)的

In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than AVX512 (and Knights Corner) has

位字段为某些负载运算指令广播连接code的数据,即指令
  从内存中加载数据,并进行一些计算
  或数据移动操作。

a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that load data from memory and perform some computational or data movement operation.

例如采用Intel汇编语法我们可以播放在用 zmm2 RAX ,然后在地址标code>并把结果写入 zmm1

For example using Intel assembly syntax we can broadcast the scalar at the address stored in rax and then multiplying with the 16 floats in zmm2 and write the result to zmm1 like this

vmulps zmm1, zmm2, [rax] {1to16}

然而,没有内在其可以做到这一点。因此,内联函数,编译器应该能够折叠

However, there are no intrinsics which can do this. Therefore, with intrinsics the compiler should be able to fold

__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);

单个指令

vmulps zmm1, zmm2, [rax] {1to16}

但我还没有看到GCC这样做。我发现这个一个 GCC bug报告。

but I have not observed GCC doing this. I found a GCC bug report about this.

我已经观察到FMA与GCC类似的东西。例如GCC 4.9不会崩溃 _mm256_add_ps(_mm256_mul_ps(areg0,breg0) -Ofast 一个FMA指令。然而,GCC 5.1确实现在坍塌到一个FMA,至少有内部函数与FMA如 _mm256_fmadd_ps 来做到这一点。但没有如 _mm512_mulbroad_ps(矢量,标量)内在。

I have observed something similar with FMA with GCC. e.g. GCC 4.9 will not collapse _mm256_add_ps(_mm256_mul_ps(areg0,breg0) to a single fma instruction with -Ofast. However, GCC 5.1 does collapse it to a single fma now. At least there are intrinsics to do this with FMA e.g. _mm256_fmadd_ps. But there is no e.g. _mm512_mulbroad_ps(vector,scalar) intrinsic.

GCC可能在某个时候解决这个问题但在那之前装配是唯一的解决方案。

GCC may fix this at some point but until then assembly is the only solution.

所以我的问题是如何在GCC内联汇编这样做呢?

So my question is how to do this with inline assembly in GCC?

我想我可能会拿出正确的语法(但我不知道)对上面的例子中GCC内联汇编。

I think I may have come up with the correct syntax (but I am not sure) for GCC inline assembly for the example above.

"vmulps        (%%rax)%{1to16}, %%zmm1, %%zmm2\n\t"

我真的在寻找这样

I am really looking for a function like this

static inline __m512 mul_broad(__m512 a, float b) {
    return a*b;
}

其中,如果 B 在记忆点到 RAX 它产生

vmulps        (%rax){1to16}, %zmm0, %zmm0
ret

如果 B 将xmm1 它产生

vbroadcastss    %xmm1, %zmm1
vmulps          %zmm1, %zmm0, %zmm0
ret


GCC将已经与内在做 vbroadcastss - 从登记的情况下,但如果 B 是在内存中,从这个内存编译为 vbroadcastss


GCC will already do the vbroadcastss-from-register case with intrinsics, but if b is in memory, compiles this to a vbroadcastss from memory.

__m512 mul_broad(__m512 a, float b) {       
    __m512 bb = _mm512_set1_ps(b);
    __m512 ab = _mm512_mul_ps(a,bb);
    return ab;
}

铛会使用,如果广播内存操作数 b 是在内存中。

clang will use a broadcast memory operand if b is in memory.

推荐答案

正如彼得科德斯指出GCC不会让您指定不同的约束替代品不同的模板。所以不是我解决方案汇编根据所选的操作数选择正确的指令。

As Peter Cordes notes GCC doesn't let you specify a different template for different constraint alternatives. So instead my solution has the assembler choose the correct instruction according to the operands chosen.

我没有一个版本支持ZMM寄存器GCC的,所以下面的例子中使用XMM寄存器和几个不存在的指令来演示如何可以实现你在找什么。

I don't have a version of GCC that supports the ZMM registers, so this following example uses XMM registers and a couple of nonexistent instructions to demonstrate how you can achieve what you're looking for.

typedef __attribute__((vector_size(16))) float v4sf;

v4sf
foo(v4sf a, float b) {
    v4sf ret;
    asm(".ifndef isxmm\n\t"
        ".altmacro\n\t"
        ".macro ifxmm operand, rnum\n\t"
        ".ifc \"\\operand\",\"%%xmm\\rnum\"\n\t"
        ".set isxmm, 1\n\t"
        ".endif\n\t"
        ".endm\n\t"
        ".endif\n\t"
        ".set isxmm, 0\n\t"
        ".set regnum, 0\n\t"
        ".rept 8\n\t"
        "ifxmm <%2>, %%regnum\n\t"
        ".set regnum, regnum + 1\n\t"
        ".endr\n\t"
        ".if isxmm\n\t"
        "alt-1 %1, %2, %0\n\t"
        ".else\n\t"
        "alt-2 %1, %2, %0\n\t"
        ".endif\n\t"
        : "=x,x" (ret)
        : "x,x" (a), "x,m" (b));
    return ret;
}


v4sf
bar(v4sf a, v4sf b) {
    return foo(a, b[0]);
}

这个例子应该的gcc -m32 -msse -O3 进行编译,应该产生类似以下两个汇编程序错误信息:

This example should be compiled with gcc -m32 -msse -O3 and should generate two assembler error messages similar to the following:

t103.c: Assembler messages:
t103.c:24: Error: no such instruction: `alt-2 %xmm0,4(%esp),%xmm0'
t103.c:22: Error: no such instruction: `alt-1 %xmm0,%xmm1,%xmm0'

这里的基本思路是汇编程序检查,看是否有第二个操作数(%2 )是XMM寄存器或别的东西,presumably的内存位置。由于GNU汇编器不支持多在字符串的操作方式,第二个操作数是在一个 .rept 循环时间相比,每一个可能的XMM寄存器之一。在 isxmm 宏用来粘贴%XMM 和寄存器数量在一起。

The basic idea here is the assembler checks to see whether the second operand (%2) is an XMM register or something else, presumably a memory location. Since the GNU assembler doesn't support much in the way of operations on strings, the second operand is compared to every possible XMM register one at a time in a .rept loop. The isxmm macro is used to paste %xmm and a register number together.

有关您的具体问题,你可能需要重写它是这样的:

For your specific problem you'd probably need to rewrite it something like this:

__m512
mul_broad(__m512 a, float b) {
    __m512 ret;
    __m512 dummy;
    asm(".ifndef isxmm\n\t"
        ".altmacro\n\t"
        ".macro ifxmm operand, rnum\n\t"
        ".ifc \"\\operand\",\"%%zmm\\rnum\"\n\t"
        ".set isxmm, 1\n\t"
        ".endif\n\t"
        ".endm\n\t"
        ".endif\n\t"
        ".set isxmm, 0\n\t"
        ".set regnum, 0\n\t"
        ".rept 32\n\t"
        "ifxmm <%[b]>, %%regnum\n\t"
        ".set regnum, regnum + 1\n\t"
        ".endr\n\t"
        ".if isxmm\n\t"
        "vbroadcastss %x[b], %[b]\n\t"
        "vmulps %[a], %[b], %[ret]\n\t"
        ".else\n\t"
        "vmulps %[b] %{1to16%}, %[a], %[ret]\n\t"
        "# dummy = %[dummy]\n\t"
        ".endif\n\t"
        : [ret] "=x,x" (ret), [dummy] "=xm,x" (dummy)
        : [a] "x,xm" (a), [b] "m,[dummy]" (b));
    return ret;
}

这篇关于与内在和装配嵌入式广播的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆