使用 SSE 计算绝对值的最快方法 [英] Fastest way to compute absolute value using SSE

查看:206
本文介绍了使用 SSE 计算绝对值的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有 3 种方法,但据我所知,一般只使用前 2 种:

  1. 使用 andpsandnotps 屏蔽符号位.

    • 优点:如果掩码已在寄存器中,则是一条快速指令,非常适合在循环中多次执行此操作.
    • 缺点:掩码可能不在寄存器中或更糟,甚至不在缓存中,从而导致非常长的内存提取.
  2. 将值从零减去求反,然后得到原值的最大值并求反.

    • 优点:固定成本,因为不需要获取任何东西,例如面具.
    • 缺点:如果条件理想,总是比掩码方法慢,我们必须等待 subps 完成才能使用 maxps 指令.立>
  3. 与选项 2 类似,将原始值从零减去以求反,然后使用 andps 将结果与原始值按位与".我运行了一个测试,将其与方法 2 进行比较,它的行为似乎与方法 2 相同,除了在处理 NaNs 时,在这种情况下结果将是不同的 NaN比方法2的结果.

    • 优点:应该比方法 2 稍微快一点,因为 andps 通常比 maxps 快.
    • 缺点:当涉及 NaN 时,这会导致任何意外行为吗?也许不是,因为 NaN 仍然是 NaN,即使它是 NaN 的不同值,对吗?

欢迎提出想法和意见.

解决方案

TL;DR:在几乎所有情况下,使用 pcmpeq/shift 生成掩码,andps 使用它.迄今为止最短的关键路径(与来自内存的常量绑定),并且不能缓存未命中.

如何使用内在函数做到这一点

让编译器在未初始化的寄存器上发出 pcmpeqd 可能很棘手.(神马).gcc/icc 最好的方法看起来是

__m128 abs_mask(void){//使用 clang,这变成了 16B 的负载,//每个调用函数都获得自己的掩码副本__m128i 减 1 = _mm_set1_epi32(-1);返回_mm_castsi128_ps(_mm_srli_epi32(minus1, 1));}//将其内联到循环中时,MSVC 很糟糕__m128 vecabs_and(__m128 v) {返回_mm_and_ps(abs_mask(), v);}__m128 sumabs(const __m128 *a) {//快速而肮脏的没有对齐检查__m128 sum = vecabs_and(*a);for (int i=1 ; i <10000 ; i++) {//gcc、clang 和 icc 在内联后将掩码设置提升到循环之外//MSVC 没有!sum = _mm_add_ps(sum, vecabs_and(a[i]));//一个累加器使 addps 延迟成为瓶颈,而不是吞吐量}返还金额;}

clang 3.5 及更高版本优化"了 set1/shift 以从内存加载常量.不过,它将使用 pcmpeqd 来实现 set1_epi32(-1).待办事项:找到一系列使用 clang 生成所需机器代码的内在函数.从内存中加载常量并不是性能灾难,但让每个函数都使用不同的掩码副本是非常糟糕的.

MSVC:VS2013:

  • _mm_uninitialized_si128() 未定义.

  • _mm_cmpeq_epi32(self,self) 在这个测试用例中(即加载一些未初始化的变量)将发出一个 movdqa xmm, [ebp-10h]来自堆栈的数据.与仅从内存中加载最终常量相比,这具有更小的缓存未命中风险.但是,Kumputer 说 MSVC 没有设法将 pcmpeqd/psrld 提升出循环(我假设在内联 vecabs 时),因此除非您手动内联并自己将常量提升到循环之外,否则这是无法使用的.

  • 使用 _mm_srli_epi32(_mm_set1_epi32(-1), 1) 导致 movdqa 加载所有 -1 的向量(提升到循环外)和 psrld 在循环内.所以这完全是可怕的.如果要加载 16B 常量,它应该是最终向量.每次循环迭代都有整数指令生成掩码也很糟糕.

对 MSVC 的建议:放弃动态生成掩码,直接写

const __m128 absmask = _mm_castsi128_ps(_mm_set1_epi32(~(1<<31));

可能您只会将掩码作为 16B 常量存储在内存中.希望不会为使用它的每个功能重复.在内存常量中使用掩码更有可能在 32 位代码中有所帮助,在这种代码中,您只有 8 个 XMM 寄存器,因此 vecabs 如果没有寄存器,则可以与内存源操作数进行 ANDPS自由地保持不断的躺着.

TODO:找出如何避免在内联的任何地方复制常量.可能使用全局常量而不是匿名 set1 会更好.但是你需要初始化它,但我不确定内部函数是否可以作为全局 __m128 变量的初始化器.您希望它进入只读数据部分,而不是在程序启动时运行的构造函数.

<小时>

或者,使用

__m128i minus1;//不明确的#if _MSC_VER &&!__INTEL_编译器减1 = _mm_setzero_si128();//PXOR 比 MSVC 从堆栈中的愚蠢加载便宜#万一减1 = _mm_cmpeq_epi32(减1,减1);//或者在这里使用一些其他变量,这可能会在没有 AVX 的情况下花费 mov insn,除非该变量已失效.const __m128 absmask = _mm_castsi128_ps(_mm_srli_epi32(minus1, 1));

额外的 PXOR 相当便宜,但它仍然是一个 uop,并且代码大小仍然是 4 个字节.如果有人有更好的解决方案来克服 MSVC 不愿发出我们想要的代码的问题,请发表评论或编辑.但是,如果内联到循环中,这并不好,因为 pxor/pcmp/psrl 都将在循环内.

使用 movd 加载 32 位常量并使用 shufps 广播可能没问题(同样,您可能必须手动将其从循环中提升出来).这是 3 条指令(mov-immediate 到 GP reg、movd、shufps),并且 movd 在 AMD 上很慢,其中向量单元在两个整数内核之间共享.(他们的超线程版本.)

<小时>

选择最佳的汇编序列

好的,让我们看看这个,让我们说英特尔 Sandybridge 通过 Skylake,稍微提到 Nehalem.请参阅 Agner Fog 的 微架构指南和说明时间,了解我是如何解决这个问题的.我还使用了某人在 http://realwordtech.com/ 论坛上的帖子中链接的 Skylake 号码.

<小时>

假设我们想要 abs() 的向量在 xmm0 中,并且是 FP 代码典型的长依赖链的一部分.

所以让我们假设任何不依赖于 xmm0 的操作都可以在 xmm0 准备好之前开始几个周期.我已经测试过,假设内存操作数的地址不是 dep 链的一部分(即不是关键路径的一部分),那么带有内存操作数的指令不会给依赖链增加额外的延迟.<小时>

当内存操作是微融合 uop 的一部分时,我不完全清楚它可以多早开始.据我了解,重新排序缓冲区 (ROB) 与融合 uops 一起使用,并跟踪从发布到停用(168(SnB)到 224(SKL)条目)的 uops.还有一个在未融合域中工作的调度程序,只保存输入操作数准备就绪但尚未执行的微指令.uop 可以在解码(或从 uop 缓存加载)时同时发送到 ROB(融合)和调度程序(未融合).如果我理解正确,从 Sandybridge 到 Broadwell 是 54 到 64 个条目 和 Skylake 中的 97.有一些毫无根据的猜测,认为它不是一个统一的(ALU/加载存储)调度程序没有了.

还有人说 Skylake 每时钟处理 6 uop.据我了解,Skylake 将每个时钟读取整个 uop 缓存行(最多 6 uop)到 uop 缓存和 ROB 之间的缓冲区中.进入 ROB/调度程序的问题仍然是 4-wide.(即使 nop 仍然是每个时钟 4 个).此缓冲区有助于其中代码对齐/uop 缓存线边界 导致之前 Sandybridge 微架构设计的瓶颈.我以前认为这个问题队列"就是这个缓冲区,但显然不是.

不管怎样,调度程序足够大,可以及时从缓存中获取数据,如果地址不在关键路径上.

<小时>

1a:带有内存操作数的掩码

ANDPS xmm0, [mask] # 在循环中

  • 字节:7个insn,16个数据.(AVX: 8 insn)
  • 融合域 uops:1 * n
  • 延迟添加到关键路径:1c(假设 L1 缓存命中)
  • 吞吐量:1/c.(Skylake: 2/c)(限制为 2 个负载/c)
  • 延迟",如果 xmm0 在此 insn 发出时已准备好:L1 缓存命中时 ~4c.
<小时>

1b:来自寄存器的掩码

movaps xmm5, [mask] # 循环外ANDPS xmm0, xmm5 # 循环# 或 PAND xmm0, xmm5 # 延迟更高,但 Nehalem 到 Broadwell 的吞吐量更高# 或使用反转掩码,如果 set1_epi32(0x80000000) 对循环中的其他内容有用:VANDNPS xmm0, xmm5, xmm0 # 这是未标记的 dest,因此非 AVX 需要额外的 movaps

  • 字节:10 insn + 16 数据.(AVX:12 insn 字节)
  • 融合域 uops:1 + 1*n
  • 延迟添加到 dep 链:1c(在循环早期具有相同的缓存未命中警告)
  • 吞吐量:1/c.(Skylake: 3/c)

PAND 是 Nehalem 到 Broadwell 上的吞吐量 3/c,但延迟 = 3c(如果在两个 FP 域操作之间使用,在 Nehalem 上更糟).我猜只有端口 5 具有将按位操作直接转发到其他 FP 执行单元(Skylake 之前)的接线.在 Nehalem 之前和 AMD 上,按位 FP 操作与整数 FP 操作的处理方式相同,因此它们可以在所有端口上运行,但存在转发延迟.

<小时>

1c:即时生成掩码:

# 循环外PCMPEQD xmm5, xmm5 # 设置为 0xff... 识别为独立于 xmm5 的旧值,但仍占用一个执行端口 (p1/p5).PSRLD xmm5, 1 # 0x7fff ... # port0# 或 PSLLD xmm5, 31 # 0x8000... 为 ANDNPS 设置ANDPS xmm0, xmm5 # 在循环中.# 端口 5

  • 字节:12(AVX:13)
  • 融合域 uops:2 + 1*n(无内存操作)
  • 延迟添加到 dep 链:1c
  • 吞吐量:1/c.(Skylake: 3/c)
  • 所有 3 个 uops 的吞吐量:1/c 使所有 3 个矢量 ALU 端口饱和
  • "latency" if xmm0 在此序列发出时已准备好(无循环):3c(如果 ANDPS 必须等待整数数据准备好,则 SnB/IvB 上可能的旁路延迟为 +1c.AgnerFog 表示在某些情况下,SnB/IvB 上的 integer->FP-boolean 没有额外的延迟.)

此版本仍然比内存中具有 16B 常量的版本占用更少的内存.它也非常适合不常调用的函数,因为没有负载会遭受缓存未命中.

旁路延迟"应该不是问题.如果 xmm0 是长依赖链的一部分,掩码生成指令将提前执行​​,因此 xmm5 中的整数结果将有时间在 xmm0 准备好之前到达 ANDPS,即使它走慢车道.

根据 Agner Fog 的测试,Haswell 对整数结果没有旁路延迟 -> FP 布尔值.他对 SnB/IvB 的描述说,某些整数指令的输出就是这种情况.因此,即使在此指令序列发出时 xmm0 已准备好的站立开始"链的开始情况,它在 *well 上也只有 3c,在 *Bridge 上只有 4c.如果执行单元在发出 uops 时尽快清除积压的 uops,则延迟可能无关紧要.

无论哪种方式,ANDPS的输出都会在FP域中,如果用于MULPS或其他东西,则没有旁路延迟.

在 Nehalem 上,旁路延迟为 2c.因此,在 Nehalem 上的 dep 链的开始处(例如,在分支错误预测或 I$ 未命中之后),如果 xmm0 已准备好,当此序列发出时为 5c,则为延迟".如果您非常关心 Nehalem,并期望此代码是在频繁的分支错误预测或类似的管道停顿后运行的第一件事,这使得 OoOE 机器无法在 xmm0 之前开始计算掩码准备好,那么这可能不是非循环情况的最佳选择.

<小时>

2a:AVX 最大值(x,0-x)

VXORPS xmm5, xmm5, xmm5 # 循环外VSUBPS xmm1, xmm5, xmm0 # 内循环VMAXPS xmm0、xmm0、xmm1

  • 字节:AVX:12
  • 融合域 uops:1 + 2*n(无内存操作)
  • 延迟添加到 dep 链:6c(Skylake:8c)
  • 吞吐量:每 2c 1 个(两个 port1 uops).(Skylake:1/c,假设 MAXPS 使用与 SUBPS 相同的两个端口.)

Skylake 删除单独的向量-FP 添加单元,并在端口 0 和 1 上的 FMA 单元中进行向量添加.这使 FP 添加吞吐量翻倍,但代价是延迟增加了 1c.FMA 延迟降至 4(从 5 in *well).x87 FADD 仍然是 3 个周期的延迟,所以仍然有一个 3 个周期的标量 80 位 FP 加法器,但只在一个端口上.

2b:相同但没有 AVX:

# 循环内XORPS xmm1, xmm1 # 不在关键路径上,甚至不在 SnB 和更高版本上使用执行单元SUBPS xmm1, xmm0MAXPS xmm0, xmm1

  • 字节:9
  • 融合域 uops:3*n(无内存操作)
  • 延迟添加到 dep 链:6c(Skylake:8c)
  • 吞吐量:每 2c 1 个(两个 port1 uops).(天湖:1/c)
  • 延迟",如果 xmm0 在此序列发出时已准备好(无循环):相同

使用处理器识别的归零习惯用法(如 xorps same,same)将寄存器归零是在 Sandbridge 系列微架构上的寄存器重命名期间处理的,并且具有零延迟和 4/C.(与 reg->reg 移动 IvyBridge 和更高版本可以消除的相同.)

虽然它不是免费的:它在融合域中仍然需要一个 uop,所以如果您的代码仅受到 4uop/周期问题率的瓶颈,这会减慢您的速度.超线程更有可能出现这种情况.

<小时>

3:ANDPS(x, 0-x)

VXORPS xmm5, xmm5, xmm5 # 循环外.没有 AVX:循环内的零 xmm1VSUBPS xmm1, xmm5, xmm0 # 内循环VANDPS xmm0, xmm0, xmm1

  • 字节:AVX:12 非 AVX:9
  • 融合域 uops:1 + 2*n(无内存操作).(没有 AVX:3*n)
  • 延迟添加到 dep 链:4c(Skylake:5c)
  • 吞吐量:1/c(使 p1 和 p5 饱和).Skylake:3/2c:(3 个向量 uop/周期)/(uop_p01 + uop_p015).
  • 延迟",如果 xmm0 在此序列发出时已准备好(无循环):相同

这应该可以工作,但 IDK 或者 NaN 会发生什么.很好的观察结果,ANDPS 具有更低的延迟,并且不需要 FPU 添加端口.

这是非 AVX 的最小尺寸.

<小时>

4:左移/右移:

PSLLD xmm0, 1PSRLD xmm0, 1

  • 字节:10(AVX:10)
  • 融合域 uops:2*n
  • 延迟添加到 dep 链:4c(2c + 旁路延迟)
  • 吞吐量:1/2c(饱和 p0,也被 FP mul 使用).(Skylake 1/c:矢量移位吞吐量翻倍)
  • "latency" if xmm0 在此序列发出时已准备好(无循环):相同

    这是 AVX 中最小的(以字节为单位).

    这可能会导致您无法保留寄存器,并且不会在循环中使用.(在没有 regs 的循环中,可能使用 andps xmm0, [mask]).

我假设从 FP 到整数移位有一个 1c 的旁路延迟,然后在返回的路上还有另一个 1c,所以这和 SUBPS/ANDPS 一样慢.它确实节省了一个无执行端口的 uop,因此如果融合域 uop 吞吐量是一个问题,它具有优势,并且您不能将掩码生成从循环中拉出来.(例如,因为这是在循环中调用的函数中,而不是内联的).

<小时>

何时使用什么:从内存加载掩码使代码变得简单,但存在缓存未命中的风险.并且占用了 16B 的 ro-data 而不是 9 个指令字节.

  • 循环中需要:1c:在循环外生成掩码(使用 pcmp/shift);在里面使用一个 andps .如果您不能节省寄存器,请将其溢出到堆栈和1a:andps xmm0, [rsp + mask_local].(与常量相比,生成和存储不太可能导致缓存未命中).无论哪种方式,都只向关键路径添加 1 个循环,循环内有 1 个单 uop 指令.这是一个 port5 uop,所以如果你的循环使随机端口饱和并且不受延迟限制,PAND 可能会更好.(SnB/IvB 在 p1/p5 上有 shuffle 单元,但是 Haswell/Broadwell/Skylake 只能在 p5 上进行 shuffle.Skylake 确实增加了 (V)(P)BLENDV 的吞吐量,但没有增加其他 shuffle-端口操作.如果 AIDA 数字是正确的,非 AVX BLENDV 是 1c lat ~3/c tput,但 AVX BLENDV 是 2c lat,1/c tput(仍然是对 Haswell 的 tput 改进))

  • 在经常调用的非循环函数中需要一次(因此您不能在多次使用时分摊掩码生成):

    1. 如果 uop 吞吐量是一个问题:1a:andps xmm0, [mask].如果这真的是瓶颈,偶尔的缓存未命中应该通过 uops 的节省来摊销.
    2. 如果延迟不是问题(该函数仅用作非循环承载的短链的一部分,例如 arr[i] = abs(2.0 + arr[i]);),并且您想避免内存中的常量:4,因为它只有 2 uop.如果 abs 位于 dep 链的开头或结尾,则从加载或到存储不会有旁路延迟.
    3. 如果 uop 吞吐量不是问题:1c:使用整数 pcmpeq/shift 动态生成.不可能发生缓存未命中,并且只会向关键路径添加 1c.

  • 在一个不常调用的函数中需要(在任何循环之外):只针对大小进行优化(小版本都不使用内存中的常量).非 AVX:3.AVX:4.它们还不错,并且不能缓存未命中.关键路径的 4 个周期延迟比版本 1c 更糟糕,因此如果您认为 3 个指令字节不是什么大问题,请选择 1c.4 版本对于性能不重要的寄存器压力情况很有趣,并且您希望避免溢出任何内容.

<小时>
  • AMD CPU:往返 ANDPS 有一个旁路延迟(它本身有 2c 延迟),但我认为它仍然是最好的选择.它仍然优于 SUBPS 的 5-6 个周期延迟.MAXPS 是 2c 延迟.由于推土机系列 CPU 上 FP 操作的高延迟,您更有可能无序执行能够及时生成您的掩码,以便在其他操作数执行 ANDPS 是.我猜推土机到 Steamroller 没有单独的 FP 加法单元,而是在 FMA 单元中进行矢量加法和乘法.3 在 AMD Bulldozer 系列 CPU 上永远是一个糟糕的选择.2 在这种情况下看起来更好,因为从 fma 域到 fp 域再返回的旁路延迟更短.请参阅 Agner Fog 的微架构指南,第 182 页(15.11 不同执行域之间的数据延迟).

  • Silvermont:与 SnB 类似的延迟.仍然使用 1c for 循环和 prob.也可一次性使用.Silvermont 是乱序的,所以可以提前准备好掩码,仍然只在关键路径上增加 1 个周期.

I am aware of 3 methods, but as far as I know, only the first 2 are generally used:

  1. Mask off the sign bit using andps or andnotps.

    • Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in a loop.
    • Cons: The mask may not be in a register or worse, not even in a cache, causing a very long memory fetch.
  2. Subtract the value from zero to negate, and then get the max of the original and negated.

    • Pros: Fixed cost because nothing is needed to fetch, like a mask.
    • Cons: Will always be slower than the mask method if conditions are ideal, and we have to wait for the subps to complete before using the maxps instruction.
  3. Similar to option 2, subtract the original value from zero to negate, but then "bitwise and" the result with the original using andps. I ran a test comparing this to method 2, and it seems to behave identically to method 2, aside from when dealing with NaNs, in which case the result will be a different NaN than method 2's result.

    • Pros: Should be slightly faster than method 2 because andps is usually faster than maxps.
    • Cons: Can this result in any unintended behavior when NaNs are involved? Maybe not, because a NaN is still a NaN, even if it's a different value of NaN, right?

Thoughts and opinions are welcome.

解决方案

TL;DR: In almost all cases, use pcmpeq/shift to generate a mask, and andps to use it. It has the shortest critical path by far (tied with constant-from-memory), and can't cache-miss.

How to do that with intrinsics

Getting the compiler to emit pcmpeqd on an uninitialized register can be tricky. (godbolt). The best way for gcc / icc looks to be

__m128 abs_mask(void){
  // with clang, this turns into a 16B load,
  // with every calling function getting its own copy of the mask
  __m128i minus1 = _mm_set1_epi32(-1);
  return _mm_castsi128_ps(_mm_srli_epi32(minus1, 1));
}
// MSVC is BAD when inlining this into loops
__m128 vecabs_and(__m128 v) {
  return _mm_and_ps(abs_mask(), v);
}


__m128 sumabs(const __m128 *a) { // quick and dirty no alignment checks
  __m128 sum = vecabs_and(*a);
  for (int i=1 ; i < 10000 ; i++) {
      // gcc, clang, and icc hoist the mask setup out of the loop after inlining
      // MSVC doesn't!
      sum = _mm_add_ps(sum, vecabs_and(a[i])); // one accumulator makes addps latency the bottleneck, not throughput
  }
  return sum;
}

clang 3.5 and later "optimizes" the set1 / shift into loading a constant from memory. It will use pcmpeqd to implement set1_epi32(-1), though. TODO: find a sequence of intrinsics that produces the desired machine code with clang. Loading a constant from memory isn't a performance disaster, but having every function use a different copy of the mask is pretty terrible.

MSVC: VS2013:

  • _mm_uninitialized_si128() is not defined.

  • _mm_cmpeq_epi32(self,self) on an uninitialized variable will emit a movdqa xmm, [ebp-10h] in this test case (i.e. load some uninitialized data from the stack. This has less risk of a cache miss than just loading the final constant from memory. However, Kumputer says MSVC didn't manage to hoist the pcmpeqd / psrld out of the loop (I assume when inlining vecabs), so this is unusable unless you manually inline and hoist the constant out of a loop yourself.

  • Using _mm_srli_epi32(_mm_set1_epi32(-1), 1) results in a movdqa to load a vector of all -1 (hoisted outside the loop), and a psrld inside the loop. So that's completely horrible. If you're going to load a 16B constant, it should be the final vector. Having integer instructions generating the mask every loop iteration is also horrible.

Suggestions for MSVC: Give up on generating the mask on the fly, and just write

const __m128 absmask = _mm_castsi128_ps(_mm_set1_epi32(~(1<<31));

Probably you'll just get the mask stored in memory as a 16B constant. Hopefully not duplicated for every function that uses it. Having the mask in a memory constant is more likely to be helpful in 32bit code, where you only have 8 XMM registers, so vecabs can just ANDPS with a memory source operand if it doesn't have a register free to keep a constant lying around.

TODO: find out how to avoid duplicating the constant everywhere it's inlined. Probably using a global constant, rather than an anonymous set1, would be good. But then you need to initialize it, but I'm not sure intrinsics work as initializers for global __m128 variables. You want it to go in the read-only data section, not to have a constructor that runs at program startup.


Alternatively, use

__m128i minus1;  // undefined
#if _MSC_VER && !__INTEL_COMPILER
minus1 = _mm_setzero_si128();  // PXOR is cheaper than MSVC's silly load from the stack
#endif
minus1 = _mm_cmpeq_epi32(minus1, minus1);  // or use some other variable here, which will probably cost a mov insn without AVX, unless the variable is dead.
const __m128 absmask = _mm_castsi128_ps(_mm_srli_epi32(minus1, 1));

The extra PXOR is quite cheap, but it's still a uop and still 4 bytes on code size. If anyone has any better solution to overcome MSVC's reluctance to emit the code we want, leave a comment or edit. This is no good if inlined into a loop, though, because the pxor/pcmp/psrl will all be inside the loop.

Loading a 32bit constant with movd and broadcasting with shufps might be ok (again, you probably have to manually hoist this out of a loop, though). That's 3 instructions (mov-immediate to a GP reg, movd, shufps), and movd is slow on AMD where the vector unit is shared between two integer cores. (Their version of hyperthreading.)


Choosing the best asm sequence

Ok, lets look at this for let's say Intel Sandybridge through Skylake, with a bit of mention of Nehalem. See Agner Fog's microarch guides and instruction timings for how I worked this out. I also used Skylake numbers someone linked in a post on the http://realwordtech.com/ forums.


Lets say the vector we want to abs() is in xmm0, and is part of a long dependency chain as is typical for FP code.

So lets assume any operations that don't depend on xmm0 can begin several cycles before xmm0 is ready. I've tested, and instructions with memory operands don't add extra latency to a dependency chain, assuming the address of the memory operand isn't part of the dep chain (i.e. isn't part of the critical path).


I'm not totally clear on how early a memory operation can start when it's part of a micro-fused uop. As I understand it, the Re-Order Buffer (ROB) works with fused uops, and tracks uops from issue to retirement (168(SnB) to 224(SKL) entries). There's also a scheduler that works in the unfused domain, holding only uops that have their input operands ready but haven't yet executed. uops can issue into the ROB (fused) and scheduler (unfused) at the same time when they're decoded (or loaded from the uop cache). If I'm understanding this correctly, it's 54 to 64 entries in Sandybridge to Broadwell, and 97 in Skylake. There's some unfounded speculation about it not being a unified (ALU/load-store) scheduler anymore.

There's also talk of Skylake handling 6 uops per clock. As I understand it, Skylake will read whole uop-cache lines (up to 6 uops) per clock into a buffer between the uop cache and the ROB. Issue into the ROB/scheduler is still 4-wide. (Even nop is still 4 per clock). This buffer helps where code alignment / uop cache line boundaries cause bottlenecks for previous Sandybridge-microarch designs. I previously thought this "issue queue" was this buffer, but apparently it isn't.

However it works, the scheduler is large enough to get the data from cache ready in time, if the address isn't on the critical path.


1a: mask with a memory operand

ANDPS  xmm0, [mask]  # in the loop

  • bytes: 7 insn, 16 data. (AVX: 8 insn)
  • fused-domain uops: 1 * n
  • latency added to critical path: 1c (assuming L1 cache hit)
  • throughput: 1/c. (Skylake: 2/c) (limited by 2 loads / c)
  • "latency" if xmm0 was ready when this insn issued: ~4c on an L1 cache hit.

1b: mask from a register

movaps   xmm5, [mask]   # outside the loop

ANDPS    xmm0, xmm5     # in a loop
# or PAND   xmm0, xmm5    # higher latency, but more throughput on Nehalem to Broadwell

# or with an inverted mask, if set1_epi32(0x80000000) is useful for something else in your loop:
VANDNPS   xmm0, xmm5, xmm0   # It's the dest that's NOTted, so non-AVX would need an extra movaps

  • bytes: 10 insn + 16 data. (AVX: 12 insn bytes)
  • fused-domain uops: 1 + 1*n
  • latency added to a dep chain: 1c (with the same cache-miss caveat for early in the loop)
  • throughput: 1/c. (Skylake: 3/c)

PAND is throughput 3/c on Nehalem to Broadwell, but latency=3c (if used between two FP-domain operations, and even worse on Nehalem). I guess only port5 has the wiring to forward bitwise ops directly to the other FP execution units (pre Skylake). Pre-Nehalem, and on AMD, bitwise FP ops are treated identically to integer FP ops, so they can run on all ports, but have a forwarding delay.


1c: generate the mask on the fly:

# outside a loop
PCMPEQD  xmm5, xmm5  # set to 0xff...  Recognized as independent of the old value of xmm5, but still takes an execution port (p1/p5).
PSRLD    xmm5, 1     # 0x7fff...  # port0
# or PSLLD xmm5, 31  # 0x8000...  to set up for ANDNPS

ANDPS    xmm0, xmm5  # in the loop.  # port5

  • bytes: 12 (AVX: 13)
  • fused-domain uops: 2 + 1*n (no memory ops)
  • latency added to a dep chain: 1c
  • throughput: 1/c. (Skylake: 3/c)
  • throughput for all 3 uops: 1/c saturating all 3 vector ALU ports
  • "latency" if xmm0 was ready when this sequence issued (no loop): 3c (+1c possible bypass delay on SnB/IvB if ANDPS has to wait for integer data to be ready. Agner Fog says in some cases there's no extra delay for integer->FP-boolean on SnB/IvB.)

This version still takes less memory than versions with a 16B constant in memory. It's also ideal for an infrequently-called function, because there's no load to suffer a cache miss.

The "bypass delay" shouldn't be an issue. If xmm0 is part of a long dependency chain, the mask-generating instructions will execute well ahead of time, so the integer result in xmm5 will have time to reach ANDPS before xmm0 is ready, even if it takes the slow lane.

Haswell has no bypass delay for integer results -> FP boolean, according to Agner Fog's testing. His description for SnB/IvB says this is the case with the outputs of some integer instructions. So even in the "standing start" beginning-of-a-dep-chain case where xmm0 is ready when this instruction sequence issues, it's only 3c on *well, 4c on *Bridge. Latency probably doesn't matter if the execution units are clearing the backlog of uops as fast as they're being issued.

Either way, ANDPS's output will be in the FP domain, and have no bypass delay if used in MULPS or something.

On Nehalem, bypass delays are 2c. So at the start of a dep chain (e.g. after a branch mispredict or I$ miss) on Nehalem, "latency" if xmm0 was ready when this sequence issued is 5c. If you care a lot about Nehalem, and expect this code to be the first thing that runs after frequent branch mispredicts or similar pipeline stalls that leaves the OoOE machinery unable to get started on calculating the mask before xmm0 is ready, then this might not be the best choice for non-loop situations.


2a: AVX max(x, 0-x)

VXORPS  xmm5, xmm5, xmm5   # outside the loop

VSUBPS  xmm1, xmm5, xmm0   # inside the loop
VMAXPS  xmm0, xmm0, xmm1

  • bytes: AVX: 12
  • fused-domain uops: 1 + 2*n (no memory ops)
  • latency added to a dep chain: 6c (Skylake: 8c)
  • throughput: 1 per 2c (two port1 uops). (Skylake: 1/c, assuming MAXPS uses the same two ports as SUBPS.)

Skylake drops the separate vector-FP add unit, and does vector adds in the FMA units on ports 0 and 1. This doubles FP add throughput, at the cost of 1c more latency. The FMA latency is down to 4 (from 5 in *well). x87 FADD is still 3 cycle latency, so there's still a 3-cycle scalar 80bit-FP adder, but only on one port.

2b: same but without AVX:

# inside the loop
XORPS  xmm1, xmm1   # not on the critical path, and doesn't even take an execution unit on SnB and later
SUBPS  xmm1, xmm0
MAXPS  xmm0, xmm1

  • bytes: 9
  • fused-domain uops: 3*n (no memory ops)
  • latency added to a dep chain: 6c (Skylake: 8c)
  • throughput: 1 per 2c (two port1 uops). (Skylake: 1/c)
  • "latency" if xmm0 was ready when this sequence issued (no loop): same

Zeroing a register with a zeroing-idiom that the processor recognizes (like xorps same,same) is handled during register rename on Sandbridge-family microarchitectures, and has zero latency, and throughput of 4/c. (Same as reg->reg moves that IvyBridge and later can eliminate.)

It's not free, though: It still takes a uop in the fused domain, so if your code is only bottlenecked by the 4uop/cycle issue rate, this will slow you down. This is more likely with hyperthreading.


3: ANDPS(x, 0-x)

VXORPS  xmm5, xmm5, xmm5   # outside the loop.  Without AVX: zero xmm1 inside the loop

VSUBPS  xmm1, xmm5, xmm0   # inside the loop
VANDPS  xmm0, xmm0, xmm1

  • bytes: AVX: 12 non-AVX: 9
  • fused-domain uops: 1 + 2*n (no memory ops). (Without AVX: 3*n)
  • latency added to a dep chain: 4c (Skylake: 5c)
  • throughput: 1/c (saturate p1 and p5). Skylake: 3/2c: (3 vector uops/cycle) / (uop_p01 + uop_p015).
  • "latency" if xmm0 was ready when this sequence issued (no loop): same

This should work, but IDK either what happens with NaN. Nice observation that ANDPS is lower latency and doesn't require the FPU add port.

This is the smallest size with non-AVX.


4: shift left/right:

PSLLD  xmm0, 1
PSRLD  xmm0, 1

  • bytes: 10 (AVX: 10)
  • fused-domain uops: 2*n
  • latency added to a dep chain: 4c (2c + bypass delays)
  • throughput: 1/2c (saturate p0, also used by FP mul). (Skylake 1/c: doubled vector shift throughput)
  • "latency" if xmm0 was ready when this sequence issued (no loop): same

    This is the smallest (in bytes) with AVX.

    This has possibilities where you can't spare a register, and it isn't used in a loop. (In loop with no regs to spare, prob. use andps xmm0, [mask]).

I assume there's a 1c bypass delay from FP to integer-shift, and then another 1c on the way back, so this is as slow as SUBPS/ANDPS. It does save a no-execution-port uop, so it has advantages if fused-domain uop throughput is an issue, and you can't pull mask-generation out of a loop. (e.g. because this is in a function that's called in a loop, not inlined).


When to use what: Loading the mask from memory makes the code simple, but has the risk of a cache miss. And takes up 16B of ro-data instead of 9 instruction bytes.

  • Needed in a loop: 1c: Generate the mask outside the loop (with pcmp/shift); use a single andps inside. If you can't spare the register, spill it to the stack and 1a: andps xmm0, [rsp + mask_local]. (Generating and storing is less likely to lead to a cache miss than a constant). Only adds 1 cycle to the critical path either way, with 1 single-uop instruction inside the loop. It's a port5 uop, so if your loop saturates the shuffle port and isn't latency-bound, PAND might be better. (SnB/IvB have shuffles units on p1/p5, but Haswell/Broadwell/Skylake can only shuffle on p5. Skylake did increase the throughput for (V)(P)BLENDV, but not other shuffle-port ops. If the AIDA numbers are right, non-AVX BLENDV is 1c lat ~3/c tput, but AVX BLENDV is 2c lat, 1/c tput (still a tput improvement over Haswell))

  • Needed once in a frequently-called non-looping function (so you can't amortize mask generation over multiple uses):

    1. If uop throughput is an issue: 1a: andps xmm0, [mask]. The occasional cache-miss should be amortized over the savings in uops, if that really was the bottleneck.
    2. If latency isn't an issue (the function is only used as part of short non-loop-carried dep chains, e.g. arr[i] = abs(2.0 + arr[i]);), and you want to avoid the constant in memory: 4, because it's only 2 uops. If abs comes at the start or end of a dep chain, there won't be a bypass delay from a load or to a store.
    3. If uop throughput isn't an issue: 1c: generate on the fly with integer pcmpeq / shift. No cache miss possible, and only adds 1c to the critical path.

  • Needed (outside any loops) in an infrequently-called function: Just optimize for size (neither small version uses a constant from memory). non-AVX: 3. AVX: 4. They're not bad, and can't cache-miss. 4 cycle latency is worse for the critical path than you'd get with version 1c, so if you don't think 3 instruction bytes is a big deal, pick 1c. Version 4 is interesting for register pressure situations when performance isn't important, and you'd like to avoid spilling anything.


  • AMD CPUs: There's a bypass delay to/from ANDPS (which by itself has 2c latency), but I think it's still the best choice. It still beats the 5-6 cycle latency of SUBPS. MAXPS is 2c latency. With the high latencies of FP ops on Bulldozer-family CPUs, you're even more likely for out-of-order execution to be able to generate your mask on the fly in time for it to be ready when the other operand to ANDPS is. I'm guessing Bulldozer through Steamroller don't have a separate FP add unit, and instead do vector adds and multiplies in the FMA unit. 3 will always be a bad choice on AMD Bulldozer-family CPUs. 2 looks better in that case, because of a shorter bypass delay from the fma domain to the fp domain and back. See Agner Fog's microarch guide, pg 182 (15.11 Data delay between different execution domains).

  • Silvermont: Similar latencies to SnB. Still go with 1c for loops, and prob. also for one-time use. Silvermont is out-of-order, so it can get the mask ready ahead of time to still only add 1 cycle to the critical path.

这篇关于使用 SSE 计算绝对值的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆