为什么TZCNT可以用于我的Sandy Bridge处理器? [英] Why does TZCNT work for my Sandy Bridge processor?
问题描述
我正在运行Core i7 3930k,它是Sandy Bridge微体系结构. 当执行以下代码(在MSVC19,VS2015下编译)时,结果使我感到惊讶(请参阅注释):
I'm running a Core i7 3930k, which is of the Sandy Bridge microarchitecture. When executing the following code (compiled under MSVC19, VS2015), the results surprised me (see in comments):
int wmain(int argc, wchar_t* argv[])
{
uint64_t r = 0b1110'0000'0000'0000ULL;
uint64_t tzcnt = _tzcnt_u64(r);
cout << tzcnt << endl; // prints 13
int info[4]{};
__cpuidex(info, 7, 0);
int ebx = info[1];
cout << bitset<32>(ebx) << endl; // prints 32 zeros (including the bmi1 bit)
return 0;
}
反汇编显示tzcnt
指令确实是从内在函数发出的:
Disassembly shows that the tzcnt
instruction is indeed emitted from the intrinsic:
uint64_t r = 0b1110'0000'0000'0000ULL;
00007FF64B44877F 48 C7 45 08 00 E0 00 00 mov qword ptr [r],0E000h
uint64_t tzcnt = _tzcnt_u64(r);
00007FF64B448787 F3 48 0F BC 45 08 tzcnt rax,qword ptr [r]
00007FF64B44878D 48 89 45 28 mov qword ptr [tzcnt],rax
为什么我没有收到#UD
无效的操作码异常,指令正确运行,并且CPU报告它不不支持上述指令?
How come I'm not getting an #UD
invalid opcode exception, the instruction functions correctly, and the CPU reports that it does not support the aforementioned instruction?
这可能是一些奇怪的微代码修订版,其中包含该指令的实现,但未报告对此指令的支持(以及bmi1
中包含的其他指令)吗?
Could this be some weird microcode revision that contains an implementation for the instruction but doesn't report support for it (and others included in bmi1
)?
我没有检查其余的bmi1
指令,但是我想知道这种现象有多普遍.
I haven't checked the rest of the bmi1
instructions, but I'm wondering how common a phenomenon this is.
推荐答案
Sandy Bridge(及更早版本)处理器似乎支持lzcnt
和tzcnt
的原因是两条指令都具有向后兼容的编码.
The reason that Sandy Bridge (and earlier) processors seem to support lzcnt
and tzcnt
is that both instructions have a backward compatible encoding.
lzcnt eax,eax = rep bsr eax,eax
tzcnt eax,eax = rep bsf eax,eax
在较旧的处理器上,rep
前缀将被忽略.
On older processors the rep
prefix is simply ignored.
这真是个好消息.
坏消息是两个版本的语义不同.
So much for the good news.
The bad news is that the semantics of both versions are different.
lzcnt eax,zero => eax = 32, CF=1, ZF=0
bsr eax,zero => eax = undefined, ZF=1
lzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1 //dest=number of msb leading zeros
bsr eax,0xFFFFFFFF => eax=31, ZF=0 //dest = bit index of highest set bit
tzcnt eax,zero => eax = 32, CF=1, ZF=0
bsf eax,zero => eax = undefined, ZF=1
tzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1 //dest=number of lsb trailing zeros
bsf eax,0xFFFFFFFF => eax=0, ZF=0 //dest = bit index of lowest set bit
当源<> 0时,至少bsf
和tzcnt
生成相同的输出.bsr
和lzcnt
对此不一致.
而且lzcnt
和tzcnt
的执行速度比bsr
/bsf
快得多.
bsf
和tzcnt
在标志用法上无法达成共识,这真是太糟糕了.
这种不必要的不一致意味着,除非可以确定其来源为非零,否则我不能使用tzcnt
替代bsf
.
At least bsf
and tzcnt
generate the same output when source <> 0. bsr
and lzcnt
do not agree on that.
Also lzcnt
and tzcnt
execute much faster than bsr
/bsf
.
It totally sucks that bsf
and tzcnt
cannot agree on the flag usage.
This needless inconsistancy means that I cannot use tzcnt
as a drop-in replacement for bsf
unless I can be sure its source is non-zero.
这篇关于为什么TZCNT可以用于我的Sandy Bridge处理器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!