x64 支持是否意味着 BMI1 支持? [英] Does x64 support imply BMI1 support?

查看:24
本文介绍了x64 支持是否意味着 BMI1 支持?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以安全地假设 x64 构建可以使用 TZCNT 不通过 cpu 标志检查其支持?

It it safe to assume that x64 builds can use TZCNT without checking its support through cpu flags?

推荐答案

不,当然不是!x86-64 是 2003 年末 (AMD K8) 的新版本,只有旧的 bsfbsr 位扫描指令,没有 BMI1 的其余部分.

No, certainly not! x86-64 was new in late 2003 (AMD K8), with only the legacy bsf and bsr bit-scan instructions, and none of the rest of BMI1.

第一个支持 BMI1 的 Intel CPU 是 2013 年的 Haswell.(同时引入了 BMI2.)
第一个支持 BMI1 的 AMD CPU 是 2012 年的 Piledriver.
AMD ABM(高级位操作) 在 K10 和更高版本的 AMD CPU 中仅添加popcntlzcnt,而不是 tzcnt.

The first Intel CPU to support BMI1 was Haswell in 2013. (Also introducing BMI2.)
The first AMD CPU to support BMI1 was Piledriver in 2012.
AMD ABM (Advanced Bit Manipulation) in K10 and later AMD CPUs only added popcnt and lzcnt, not tzcnt.

维基百科 位操作指令集:支持 CPU.请注意,Celeron/Pentium 品牌的 CPU 不解码 VEX 前缀,因此它们禁用了 AVX 和 BMI1/BMI2,因为 BMI1 和 2 均包含一些 VEX 编码指令,例如 andnblsr.这很糟糕;BMI1/2 当编译器可以在任何地方使用它时最有用一个用于更有效的可变计数移位和窥视孔的可执行文件,因此仍然销售没有 BMI1/2 的新 CPU 并没有让我们更接近于能够像我们在 32 中为 P6 cmov 做的那样将它们视为基线-位模式.

Wikipedia Bit Manipulation Instruction Sets: Supporting CPUs. Note that Celeron/Pentium branded CPUs don't decode VEX prefixes, so they have AVX and BMI1/BMI2 disabled because BMI1 and 2 each include some VEX-coded instructions like andn and blsr. This sucks; BMI1/2 are most useful when compilers can use it everywhere throughout an executable for more efficient variable-count shifts, and peepholes, so still selling new CPUs without BMI1/2 is not getting us closer to being able to treat them as baseline like we do for P6 cmov in 32-bit mode.

既然你特别提到了 tzcnt,它的机器码编码是 rep bsf,所以旧的 CPU 会将它作为 BSF 执行.如果输入非零,这会产生与 tzcnt 相同的结果.即 tzcnt 作品"在所有 x86 CPU(自 386 起)上,当输入为非零时.

Since you mention tzcnt specifically, its machine-code encoding is rep bsf so older CPUs will execute it as BSF. This produces the same result as tzcnt if the input is non-zero. i.e. tzcnt "works" on all x86 CPUs (since 386) when the input is non-zero.

但是当它为零时,tzcnt 会产生操作数大小(例如 64),但 bsf 离开目的地注册未修改.tzcnt 根据结果设置 FLAGS,bsf 根据输入设置.AMD 在其 ISA 参考手册中记录了未修改的 dst 行为.英特尔仅将其记录为未定义值";但至少在现有 CPU 中实现了与 AMD 相同的行为.

But when it is zero, tzcnt would produce the operand-size (e.g. 64), but bsf leaves the destination register unmodified. tzcnt sets FLAGS based on the result, bsf based on the input. AMD documents the dst-unmodified behaviour in their ISA reference manual. Intel only documents it as "undefined value" but implements the same behaviour as AMD, at least in existing CPUs.

(这就是为什么 bsf/bsr 对所有 CPU 都有输出依赖,比如 add.不幸的是 tzcnt>/lzcnt 在 Skylake 之前也有对 Intel Sandybridge 系列的错误依赖:为什么打破 LZCNT 的输出依赖"很重要?.为什么 popcnt 对 SnB-family 在 Cannon/Ice Lake 之前,因为 它共享相同的执行单元.)

(This is why bsf / bsr have an output dependency on all CPUs, like add. Unfortunately tzcnt / lzcnt also have a false dependency on Intel Sandybridge-family before Skylake: Why does breaking the "output dependency" of LZCNT matter?. And why popcnt does on SnB-family before Cannon / Ice Lake, because it shares the same execution unit.)

tzcnt 在 AMD 上明显更快,因此编译器会针对通用"进行调优.或者 AMD CPU 通常会使用 tzcnt 而不是 bsf 而不检查 CPU 功能.

tzcnt is significantly faster on AMD, so compilers tuning for "generic" or AMD CPUs will often use tzcnt instead of bsf without checking for CPU features.

例如对于 GNU C __builtin_ctz.该内在函数对于 input=0 具有未定义的行为,因此允许只使用 bsf 而不检查 0.因此也允许使用 tzcnt 因为在这种情况下的结果不是什么都保证.

e.g. for GNU C __builtin_ctz. That intrinsic has undefined behaviour for input=0 so it's allowed to just use bsf without checking for 0. And thus also allowed to use tzcnt because the result in that case is not guaranteed by anything.

为什么 TZCNT 适用于我的 Sandy Bridge 处理器?

lzcnt 不存在这种向后/向前兼容.让它解码为 rep bsr 并忽略无意义的 rep 前缀会给你 31 - lzcnt(x),位索引.https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/

No such backward / forward compat exists for lzcnt. Having it decode as rep bsr with the meaningless rep prefix ignored would give you 31 - lzcnt(x), the bit-index. https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/

一个方便的技巧是 ctz( x | 0x80000000 ) 因为 OR 很便宜1,并且保证总有一个非零位bsf 找到.它不会更改任何非零 x 的结果,因为它是 bsf 将查看的最后一位.这个技巧也适用于 __builtin_clz(x|1)/bsr,因为 or reg, imm8 甚至比 更短>imm32.

One handy trick is ctz( x | 0x80000000 ) because OR is cheap1, and guarantees there's always a non-zero bit for bsf to find. It doesn't change the result for any non-zero x because it's the last bit bsf will look at. This trick also works for __builtin_clz(x|1) / bsr, where it's even better because or reg, imm8 is even shorter than imm32.

脚注 1:or reg, imm32 适用于 32 位常量;bts reg,63 在某些 CPU 上为 64 位输入实现 x|(1ULL<<63) 的成本较低.

Footnote 1: or reg, imm32 works for a 32-bit constant; bts reg,63 is less cheap on some CPUs to implement x|(1ULL<<63) for a 64-bit input.

这篇关于x64 支持是否意味着 BMI1 支持?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆