在混合上下文中选择 SSE 指令执行域 [英] Choosing SSE instruction execution domains in mixed contexts

查看:29
本文介绍了在混合上下文中选择 SSE 指令执行域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一些 SSE 汇编代码,其中没有足够的 xmm 寄存器来同时将所有临时结果和有用的常量保存在寄存器中.

I am playing with a bit of SSE assembly code in which I do not have enough xmm registers to keep all the temporary results and useful constants in registers at the same time.

作为一种变通方法,对于一些具有相同分量的常量向量,我将几个向量压缩"到一个 xmm 寄存器中,如下所示的 xmm14.我使用 pshufd 指令来解压我需要的常量向量.这条指令有一点延迟,但由于它需要一个源寄存器和一个目标寄存器,否则非常方便:

As a workaround, for some constant vectors that have identical components, I "compress" several vectors into a single xmm register, xmm14 below. I use the pshufd instruction to decompress the constant vector I need. This instruction has a bit of latency, but since it takes a source and a destination register, it is otherwise very convenient:

…
Lfour_15_9:
    .long 4
    .long 1549556828
    .long 909522486
    .long 0
…
    movdqa  Lfour_15_9(%rip), %xmm14
…
    pshufd  $0, %xmm14, %xmm4
    paddd   %xmm4, %xmm3
…
    pshufd  $0b10101010, %xmm14, %xmm5
…
    pshufd  $0b10101010, %xmm14, %xmm5
…
    pshufd  $0b01010101, %xmm14, %xmm5
    xorps   %xmm5, %xmm2    
    movaps  %xmm5, 112(%rax)

以上代码采用 gas/AT&T 语法,我的目标是从 Core 2 到 Westmere 的 Intel 处理器,提供高达 SSSE3 的指令.

The above code is in gas/AT&T syntax and I am targeting Intel processors from Core 2 to Westmere, that offer instructions up to SSSE3.

Agner Fog 的手册之一指出,对于某些用途,它可能是有利的使用具有错误类型"的向量指令.例如,memcpy 有利于使用 movaps 指令编写,即使被移动的数据不是浮点数,因为 movaps 短>movdqa 可在更多处理器上使用,并且由于它不使用数据进行计算,因此关于次正规的常见警告都不适用.同样的建议也用于混洗单词(我之前链接的手册中的第 13.2 和 13.3 节).

One of Agner Fog's manuals points out that for some uses, it may be advantageous to use vector instructions that have the wrong "type". For instance, memcpy is advantageous to write with movaps instructions even if the data being moved is not floating-point because movaps is shorter than movdqa, is available on more processors, and since it does not compute with the data, none of the usual caveats about subnormals apply. The same advice is given for shuffling words around (sections 13.2 and 13.3 in the manual I linked to earlier).

我的情况有点特殊,因为我打算重构常量向量,如有必要,有些可以仅与单精度类型"指令一起使用:这些将只涉及movapsshufpsxorps 计算.并且一些常量向量将不得不参与只能使用整数类型指令完成的计算:paddd(因此我可以使用 movdqapshufdpxor 指令以保留在整数执行域中).

My case is a bit special because of the constant vectors I aim to reconstitute, some can, if necessary, be used with only single-precision "type" instructions: these will be involved only in movaps, shufps, xorps computations. And some constant vectors will have to participate into computations that can only be done with integer-type instructions: paddd (and thus I can use movdqa, pshufd and pxor instructions as necessary to remain in the integer execution domain).

这个问题的一般版本是:考虑到我的目标是 Core 2 和 Westmere 之间的 Intel 处理器,我应该分别使用什么类型的指令从内存中(重新)加载 xmm14,到将它解压缩到一个只能看到单精度计算的寄存器,将它解压缩到一个可以看到一些单精度指令无法完成的计算的寄存器,以及那些可以用单精度指令完成的操作后一种情况?

The general version of this question is: considering that I am targeting Intel processors between Core 2 and Westmere, what types of instructions should I use respectively to (re-)load xmm14 from memory, to uncompress it to a register that will only see single-precision computations, to uncompress it to a register that will see some computations that cannot be done with single-precision instructions, and for those operations that can be done with single-precision instructions in the latter case?

哈罗德在评论中回答了这一点下方的问题部分.

The part of the question below this point was answered by harold in a comment.

还有一个包含在一般问题中的更具体的子问题:当我用浮点指令(例如 movdqa 指令)随机替换一些整数执行域指令时,有没有人解释为什么movaps 指令),该函数计算错误?我预计唯一的后果是执行延迟,而不是错误的结果.

And a more specific sub-question that's included in the general question: does anyone have an explanation for why, when I randomly replace some integer execution domain instructions by floating-point instructions (e.g. movdqa instructions by movaps instructions), the function can compute wrong? I expected the only consequence would be execution delays, not wrong results.

例如,如果在上面我仅将 pshufd $0, %xmm14, %xmm4 指令更改为 shufps 指令,则计算变得完全错误(xmm4 是后面的 paddd 中涉及的寄存器).更改其他指令而不是该指令会导致其他类型的错误.

For instance, if in the above I change only the pshufd $0, %xmm14, %xmm4 instruction to a shufps one, the computations become completely wrong (xmm4 is the register that is involved in a paddd later). Changing other instructions instead of that one result in other kinds of errors.

推荐答案

对于 xor 之类的东西,首选整数域指令.在 Intel CPU 上,只有一个执行端口可以处理 FP 域逻辑(XORPS 等),但大多数执行单元(On SnB 到 Haswell:p015,但不是 Haswell 的端口 6)可以处理向量整数逻辑指令(PAND/POR/PXOR).

Prefer integer-domain instructions for things like xor. On Intel CPUs, only one execution port can handle FP-domain logicals (XORPS, etc.), but most of the execution units (On SnB to Haswell: p015, but not Haswell's port 6) can handle vector integer logical instructions (PAND/POR/PXOR).

根据 Agner Fog 的测试,如果需要 FP 域指令的结果作为向量整数域指令的输入,有时会额外花费 1 个周期的延迟.(请参阅微架构文档).这适用于 AMD 和英特尔.这仅在指令位于关键路径上时才重要.(循环中最长的 dep 链).

Sometimes it costs an extra 1 cycle of latency if the result of a FP-domain instruction is needed an an input to an vector-int-domain instruction, according to Agner Fog's testing. (See the microarchitecture docs). This applies to AMD and Intel. This only matters if the instruction is on the critical path. (longest dep chain in the loop).

正确性不是问题,除非您发现指令的非正交性让您感到困惑.shufpspshufd 的作用不同.vpermilps ymm, ymm, imm 确实与 pshufd 做同样的事情,我认为,而且似乎只是被引入,以便您可以将随机播放与内存加载相结合.(否则,您可以使用 shufps 的 AVX 版本与两个源的寄存器相同,并获得相同的行为).

Correctness isn't an issue, except as you found when the non-orthogonality of the instructions trips you up. shufps doesn't do the same thing as pshufd. vpermilps ymm, ymm, imm does do the same thing as pshufd, I think, and seems to only have been introduced so you can combine a shuffle with a load from memory. (Otherwise you could just use the AVX version of shufps with the same register as both sources, and get the same behaviour).

IDK 如果有人彻底测试了所有使用较短指令编码 ...ps 版本没有额外延迟的情况.不过,SnB 和更高版本的 Intel CPU 中的 uop 缓存使内部循环的问题减少了.(指令解码只是第一次循环的瓶颈.)

IDK if anyone's thoroughly tested all the cases where there's no extra latency for using the shorter-instruction-encoding ...ps versions of things. The uop cache in SnB and later Intel CPUs makes that less of an issue for inner loops, though. (Instruction decoding is only a bottleneck the first time through a loop.)

除了 uop-cacheline 边界,如果您的代码可以维持完整的 4 uop/周期,这可能是瓶颈.IDK,如果有任何工具可以帮助对齐 x86 指令,以便 uop 缓存行保持 4 uop 的倍数.

edit: except for uop-cacheline boundaries, which can be the bottleneck if your code could otherwise sustain the full 4 uops / cycle. IDK if there are any tools for helping align x86 instructions so the uop cache lines hold multiples of 4 uops.

这篇关于在混合上下文中选择 SSE 指令执行域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆