在混合上下文中选择 SSE 指令执行域 [英] Choosing SSE instruction execution domains in mixed contexts

查看：29 发布时间：2021/8/27 19:46:28 assembly vector sse

本文介绍了在混合上下文中选择 SSE 指令执行域的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用一些 SSE 汇编代码，其中没有足够的 xmm 寄存器来同时将所有临时结果和有用的常量保存在寄存器中.

I am playing with a bit of SSE assembly code in which I do not have enough xmm registers to keep all the temporary results and useful constants in registers at the same time.

作为一种变通方法，对于一些具有相同分量的常量向量，我将几个向量压缩"到一个 xmm 寄存器中，如下所示的 xmm14.我使用 pshufd 指令来解压我需要的常量向量.这条指令有一点延迟，但由于它需要一个源寄存器和一个目标寄存器，否则非常方便:

As a workaround, for some constant vectors that have identical components, I "compress" several vectors into a single xmm register, xmm14 below. I use the pshufd instruction to decompress the constant vector I need. This instruction has a bit of latency, but since it takes a source and a destination register, it is otherwise very convenient:

…
Lfour_15_9:
    .long 4
    .long 1549556828
    .long 909522486
    .long 0
…
    movdqa  Lfour_15_9(%rip), %xmm14
…
    pshufd  $0, %xmm14, %xmm4
    paddd   %xmm4, %xmm3
…
    pshufd  $0b10101010, %xmm14, %xmm5
…
    pshufd  $0b10101010, %xmm14, %xmm5
…
    pshufd  $0b01010101, %xmm14, %xmm5
    xorps   %xmm5, %xmm2    
    movaps  %xmm5, 112(%rax)

以上代码采用 gas/AT&T 语法，我的目标是从 Core 2 到 Westmere 的 Intel 处理器，提供高达 SSSE3 的指令.

The above code is in gas/AT&T syntax and I am targeting Intel processors from Core 2 to Westmere, that offer instructions up to SSSE3.

Agner Fog 的手册之一指出，对于某些用途，它可能是有利的使用具有错误类型"的向量指令.例如，memcpy 有利于使用 movaps 指令编写，即使被移动的数据不是浮点数，因为 movaps 比 短>movdqa 可在更多处理器上使用，并且由于它不使用数据进行计算，因此关于次正规的常见警告都不适用.同样的建议也用于混洗单词(我之前链接的手册中的第 13.2 和 13.3 节).

One of Agner Fog's manuals points out that for some uses, it may be advantageous to use vector instructions that have the wrong "type". For instance, memcpy is advantageous to write with movaps instructions even if the data being moved is not floating-point because movaps is shorter than movdqa, is available on more processors, and since it does not compute with the data, none of the usual caveats about subnormals apply. The same advice is given for shuffling words around (sections 13.2 and 13.3 in the manual I linked to earlier).

我的情况有点特殊，因为我打算重构常量向量，如有必要，有些可以仅与单精度类型"指令一起使用:这些将只涉及movaps、shufps、xorps 计算.并且一些常量向量将不得不参与只能使用整数类型指令完成的计算:paddd(因此我可以使用 movdqa、pshufd 和 pxor 指令以保留在整数执行域中).

My case is a bit special because of the constant vectors I aim to reconstitute, some can, if necessary, be used with only single-precision "type" instructions: these will be involved only in movaps, shufps, xorps computations. And some constant vectors will have to participate into computations that can only be done with integer-type instructions: paddd (and thus I can use movdqa, pshufd and pxor instructions as necessary to remain in the integer execution domain).

这个问题的一般版本是:考虑到我的目标是 Core 2 和 Westmere 之间的 Intel 处理器，我应该分别使用什么类型的指令从内存中(重新)加载 xmm14，到将它解压缩到一个只能看到单精度计算的寄存器，将它解压缩到一个可以看到一些单精度指令无法完成的计算的寄存器，以及那些可以用单精度指令完成的操作后一种情况?

The general version of this question is: considering that I am targeting Intel processors between Core 2 and Westmere, what types of instructions should I use respectively to (re-)load xmm14 from memory, to uncompress it to a register that will only see single-precision computations, to uncompress it to a register that will see some computations that cannot be done with single-precision instructions, and for those operations that can be done with single-precision instructions in the latter case?

哈罗德在评论中回答了这一点下方的问题部分.

The part of the question below this point was answered by harold in a comment.

还有一个包含在一般问题中的更具体的子问题:当我用浮点指令(例如 movdqa 指令)随机替换一些整数执行域指令时，有没有人解释为什么movaps 指令)，该函数计算错误?我预计唯一的后果是执行延迟，而不是错误的结果.

And a more specific sub-question that's included in the general question: does anyone have an explanation for why, when I randomly replace some integer execution domain instructions by floating-point instructions (e.g. movdqa instructions by movaps instructions), the function can compute wrong? I expected the only consequence would be execution delays, not wrong results.

例如，如果在上面我仅将 pshufd $0, %xmm14, %xmm4 指令更改为 shufps 指令，则计算变得完全错误(xmm4 是后面的 paddd 中涉及的寄存器).更改其他指令而不是该指令会导致其他类型的错误.

For instance, if in the above I change only the pshufd $0, %xmm14, %xmm4 instruction to a shufps one, the computations become completely wrong (xmm4 is the register that is involved in a paddd later). Changing other instructions instead of that one result in other kinds of errors.

在混合上下文中选择 SSE 指令执行域 [英] Choosing SSE instruction execution domains in mixed contexts

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在混合上下文中选择 SSE 指令执行域 [英] Choosing SSE instruction execution domains in mixed contexts

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭