为什么要打破“输出依赖"?LZCNT 的问题? [英] Why does breaking the "output dependency" of LZCNT matter?

查看:34
本文介绍了为什么要打破“输出依赖"?LZCNT 的问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对某些东西进行基准测试时,我测得的吞吐量比我计算的要低得多,我将其范围缩小到 LZCNT 指令(TZCNT 也会发生这种情况),如下面的基准测试所示:

 xor ecx, ecx_benchloop:lzcnt eax, edx添加 ecx, 1jnz _benchloop

还有:

 xor ecx, ecx_benchloop:异或 eax, eax ;这不应该有帮助,但确实有帮助lzcnt eax, edx添加 ecx, 1jnz _benchloop

第二个版本要快得多.不应该.没有理由为什么 LZCNT 应该对其输出有输入依赖性.与 BSR/BSF 不同,xZCNT 指令总是覆盖它们的输出.

我在 4770K 上运行它,所以 LZCNT 和 TZCNT 不会作为 BSR/BSF 执行.

这是怎么回事?

解决方案

这只是您的 Intel Haswell CPU 和几个以前的1 CPU 的微架构中的一个限制.从 Skylake-S(客户端)开始,tzcntlzcnt 已修复该问题,但 popcnt 的问题仍然存在,直到在 Cannon 中修复湖.

在这些微架构上,tzcntlzcntpopcnt 的目标操作数被视为输入依赖项,即使在语义上它不是.现在我怀疑这真的是一个错误":如果这只是一个无意的行为/疏忽,我希望它会在自推出以来发布的几个新微架构之一中得到修复.

很可能是基于以下两个因素之一或两者的设计妥协:

  • popcntlzcnttzcnt 的硬件是 可能与现有的 bsfbsr 指令共享.现在 bsfbsr do 依赖于之前的目标值在实践中2 对于所有位为零输入的特殊情况,因为英特尔芯片在这种情况下未修改目标.因此,对于组合硬件的最简单设计完全有可能导致在同一单元上执行的其他类似指令继承相同的依赖关系.

  • 绝大多数 x86 两个操作数 ALU 指令都依赖于目标操作数,因为它也用作源操作数.三个受影响的指令有点独特,因为它们是一元运算符,但与使用单个操作数的现有一元运算符(如notneg)不同作为源和目标,它们具有不同的源和目标操作数,使它们表面上类似于大多数 2 输入指令.也许重命名器/调度器电路只是没有区分这些具有两个寄存器操作数的一元的特殊情况与绝大多数没有这种依赖性的普通共享源/目标 2 输入指令.

事实上,对于 popcnt 的情况,Intel 已经发布了各种勘误表来涵盖虚假依赖问题,例如 HSD146 用于 Haswell 桌面,SKL029 用于 Skylake,内容如下:

<块引用>

POPCNT 指令的执行时间可能比预期的要长

问题 32 位或 64 位操作数的 POPCNT 指令执行可能是延迟到之前的非依赖指令执行完毕.

含义使用 POPCNT 指令的软件可能会遇到低于预期的性能.

解决方法未确定

我总是发现这个勘误不寻常,因为它并没有真正识别任何类型的功能缺陷或不符合规范,而基本上所有其他勘误都是这种情况.英特尔并没有真正记录 OoO 执行引擎的特定性能模型,而且还有大量其他性能问题".这些年来出现和消失的问题(许多问题的影响比这个非常小的问题要大得多),但在勘误表中没有记录.尽管如此,这也许提供了一些证据,证明它可以被视为一个错误.奇怪的是,勘误从未扩展到包括 tzcntlzcnt,它们在引入时也有同样的问题.


1 那么tzcntlzcnt只出现在Haswell中,但是popcnt也存在这个问题是在 Nehalem 中引入的 - 但错误的依赖问题也许只存在桑迪Bridge 或更高版本.

2 在实践中,虽然在 ISA 文档中没有记录,因为在英特尔手册中未定义全零输入的结果.然而,在这种情况下,大多数或所有英特尔芯片都实现了保持目标寄存器不变的行为.
AMD确实文档和保证 bsfbsr 的行为.

(但不幸的是,这些指令比 tzcnt/lzcnt在 AMD 上(额外的 uops,请参阅 https://uops.info/),而不是利用该 bsf 行为,对于 AMD CPU 来说,使用 rep bsf 通常会更好,因此它会在知道该指令的 CPU 上解码为 tzcnt,并且 test/cmov 如果你有足够的空闲寄存器.但是即使对于非零输入,bsr 也会给 lzcnt 提供不同的结果,因此您可以考虑利用它.)

While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following benchmarks:

  xor ecx, ecx
_benchloop:
  lzcnt eax, edx
  add ecx, 1
  jnz _benchloop

And:

  xor ecx, ecx
_benchloop:
  xor eax, eax  ; this shouldn't help, but it does
  lzcnt eax, edx
  add ecx, 1
  jnz _benchloop

The second version is much faster. It shouldn't be. There's no reason why LZCNT should have an input dependency on its output. Unlike BSR/BSF, the xZCNT instructions always overwrite their output.

I'm running this on a 4770K, so LZCNT and TZCNT aren't being executed as BSR/BSF.

What's going on here?

解决方案

This is simply a limitation in the micro-architecture of your Intel Haswell CPU and several previous1 CPUs. It has been fixed for tzcnt and lzcnt as of Skylake-S (client), but the issue remained for popcnt until it was fixed in Cannon Lake.

On those micro-architectures the destination operand for tzcnt, lzcnt and popcnt is treated as an input dependency even though, semantically, it is not. Now I doubt this is a really a "bug": if it was simply an unintended behavior/oversight, I expect it would have been fixed in one of the several new micro-architectures that have been released since it was introduced.

Most likely it is a design compromise based on one or both of the following two factors:

  • The hardware for popcnt, lzcnt and tzcnt is likely all shared with the existing bsf and bsr instructions. Now bsf and bsr do have a dependency on the previous destination value in practice2 for the special case of all-bits-zero input, since Intel chips leave the destination unmodified in that case. So it is entirely possible that the simplest design for the combined hardware resulted in the other similar instructions executing on the same unit inheriting the same dependency.

  • The vast majority of x86 two operand ALU instructions have an dependency on the destination operand, since it is used as a source as well. The three affected instructions are somewhat unique in that they are unary operators, but unlike existing unary operators like not and neg which have a single operand used as source and destination, they have distinct source and destination operands, making them superficially similar to most 2-input instructions. Perhaps the renamer/scheduler circuitry just doesn't distinguish the special case of these unary-with-two-register-operand versus the vast majority of plain shared source/destination 2-input instructions which don't have this dependency.

In fact, for the case of popcnt Intel has issued various errata covering the false dependency issue such as HSD146 for Haswell Desktop and SKL029 for Skylake, which reads:

POPCNT Instruction May Take Longer to Execute Than Expected

Problem POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non -dependent instructions have executed.

Implication Software using the POPCNT instruction may experience lower performance than expected.

Workaround None identified

I always found this erratum unusual since it isn't really identifying any type of functional defect or non-conformance to specification which is the case for essentially all the other errata. Intel doesn't really document a specific performance model for the OoO execution engine and there are a ton of other performance "gotchas" that have appeared and disappeared over the years (many with a much larger impact than this very minor issue) that don't get documented in errata. Still, this perhaps provides some evidence that it can be considered a bug. Oddly, the erratum was never extended to include tzcnt or lzcnt which had the same issue when they were introduced.


1 Well tzcnt and lzcnt only appeared in Haswell, but the problem exists for popcnt as well which was introduced in Nehalem - but the false dependency problem perhaps only exists for Sandy Bridge or later.

2 In practice, although not documented in the ISA docs, since the result for all-zero input was undefined in the Intel manuals. Most or all Intel chips implemented the behavior as leaving the destination register unchanged in this case, however.
AMD does document and guarantee that behaviour for bsf and bsr.

(But unfortunately those instructions are slower than tzcnt/lzcnt on AMD (extra uops, see https://uops.info/), so instead of taking advantage of that bsf behaviour, it would often be better for AMD CPUs to use rep bsf so it will decode as tzcnt on CPUs that know about that instruction, and test/cmov if you have enough free registers. But bsr gives different results to lzcnt even for non-zero input, so you might consider taking advantage of it.)

这篇关于为什么要打破“输出依赖"?LZCNT 的问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆