切换特定位 [英] Toggle a Specific Bit
问题描述
所以我看到了类似的问题 在ith正子上切换和
So I have seen the questions like toggle a bit at ith positon and How do you set, clear, and toggle a single bit?, but I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?
我尝试用C编写它并浏览程序集,但并不太清楚为什么会有某些东西.
I tried writing it in C and looking through the assembly and don't quite understand exactly why there are some things that are there.
C:
unsigned long toggle(unsigned long num, unsigned long bit)
{
num ^= 1 << bit;
return num;
}
int main()
{
printf("%ld\n", toggle(100, 60));
return 0;
}
从GDB切换功能汇编:
Toggle function assembly from GDB:
<toggle>
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-0x8],rdi
mov QWORD PTR [rbp-0x10],rsi
mov rax, QWORD PTR [rbp-0x10]
mov edx, 0x1
mov ecx, eax
shl edx, cl
mov eax, edx
cdqe
xor QWORD PTR [rbp-0x8],rax
mov rax, QWORD PTR [rbp-0x8]
pop rbp
ret
有人可以引导我了解汇编级别的情况,以便我可以更好地理解这一点,并在x86-64中编写自己的切换功能吗?
Can someone walk me through what's going on on the assembly level so I can better understand this and write my own toggle function in x86-64?
推荐答案
我想知道在x86-64组件中是否有一种很好的方法可以在ith位置进行切换?
I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?
是的, x86的BTC
(位测试和补码)指令完全可以做到(以及将CF设置为该位的旧值),并在所有现代CPU上高效运行.
Yes, x86's BTC
(Bit Test and Complement) instruction does exactly that (as well as setting CF to the old value of the bit), and runs efficiently on all modern CPUs.
- Intel SnB系列:1 uop,1c延迟,每个时钟吞吐量2个. (Nehalem和更早的版本:每个时钟1个)
- Silvermont/KNL:1 uop,1c延迟,每时钟吞吐量1个.
- AMD Ryzen:2微秒,2c延迟,每个时钟吞吐量2个
- AMD Bulldozer系列/美洲虎:2 uops,2c延迟,每时钟吞吐量1个
- AMD K8/K10:2微秒,2c延迟,每时钟吞吐量1个
来源: Agner Fog的指令表和x86优化指南.另请参见 x86 标签Wiki的其他性能链接.
Source: Agner Fog's instruction tables and x86 optimization guide. See also other performance links in the x86 tag wiki.
toggle:
mov rax, rdi
btc rax, rsi
ret
(如果您用C正确编写了toggle
).
(If you'd written toggle
correctly in C).
不要将btc
与内存操作数一起使用:位字符串指令具有疯狂的CISC语义,其中位索引不限于寻址模式选择的dword内. (因此,btc m,r
是10微秒,在Skylake上每5c吞吐量中有一个).但是使用寄存器操作数,移位计数将完全像变量计数移位一样被屏蔽.
Don't use btc
with a memory operand: the bit-string instructions have insane CISC semantics where the bit-index isn't limited to within the dword selected by the addressing mode. (So btc m,r
is 10 uops with one per 5c throughput on Skylake). But with a register operand, the shift-count is masked exactly like variable-count shifts.
不幸的是,即使使用-march=haswell
或-mtune=intel
,gcc和clang也缺少此窥孔优化.即使在AMD上也值得使用,但在Intel上甚至更有效.
Unfortunately gcc and clang miss this peephole optimization, even with -march=haswell
or -mtune=intel
. It's worth using even on AMD, but it's even more efficient on Intel.
在btc
比xor
慢的AMD CPU上,值得在寄存器中生成掩码并使用xor
.甚至在Intel CPU上,在内存中切换一点也是值得的. (内存目标xor
比内存目标btc
好得多.)
On AMD CPUs where btc
is slower than xor
, it's worth generating the mask in a register and using xor
. Or even on Intel CPUs, this is worth it to toggle a bit in memory. (memory-destination xor
is much better than memory-destination btc
).
对于数组中的多个元素,请使用SSE2 pxor
.您可以使用以下方法生成蒙版:
For multiple elements in an array, use SSE2 pxor
. You can generate the mask with:
pcmpeqd xmm0, xmm0 ; -1 all bits set
psrlq xmm0, 63 ; 1 just a single bit set
movd xmm1, esi
psllq xmm0, xmm1 ; 1<<bit
; then inside a loop, with data in xmm1
pxor xmm1, xmm0 ; flip bit in each qword element
不太清楚为什么会有一些东西.
don't quite understand exactly why there are some things that are there.
所有这些废话之所以存在,是因为您未经优化就进行了编译,并且使用了带符号的int
常量.
All that crap is there because you compiled without optimization, and because you used a signed int
constant.
从-O0
代码到内存的所有溢出/重新加载甚至都不值得一看.如果您想要不烂的代码,请使用-O3 -march=native
进行编译.
It's not even worth looking at all the spill/reload to memory from the -O0
code. Compile with -O3 -march=native
if you want code that doesn't suck.
另请参见如何删除噪声" 和 Matt Godbolt的CppCon2017演讲中:我的编译器完成了什么?我最近取消编译器的盖子" ,以很好地介绍编译器生成的asm.
See also How to remove "noise" from GCC/clang assembly output?, and Matt Godbolt's CppCon2017 talk: "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid" for a good intro to looking at compiler-generated asm.
使用带符号的int
常量1 << bit
解释了为什么gcc进行了32位移位,然后又进行了cdqe
移位. num ^= 1 << bit;
等效于
Using the signed int
constant 1 << bit
explains why gcc did a 32-bit shift and then cdqe
. num ^= 1 << bit;
is equivalent to
int mask = 1;
mask <<= bit; // still signed int
num ^= mask; // mask is sign-extended to 64-bit here.
在gcc -O3输出中,我们得到
In gcc -O3 output, we get
mov edx, 1
sal edx, cl # 1<<bit (32-bit)
movsx rax, edx # sign-extend, like cdqe does for eax->rax
xor rax, rdi
如果我们坚定地写toggle
:
uint64_t toggle64(uint64_t num, uint32_t bit) {
num ^= 1ULL << bit;
return num;
}
(source+asm on the Godbolt compiler explorer)
gcc和clang仍然会错过使用btc
的功能,但这并不可怕.有趣的是,MSVC确实发现了btc
窥孔,但浪费了MOV指令:
gcc and clang still miss using btc
, but it's not horrible. Interestingly, MSVC does spot the btc
peephole, but wastes a MOV instruction:
toggle64 PROC
mov eax, edx
btc rcx, rax
mov rax, rcx
ret 0
使用uint64_t
位可避免产生额外的MOV.这是不必要的,因为具有寄存器目标的btc
用& 63
掩盖了索引.高垃圾率不是问题,但是MSVC不知道这一点.
Using uint64_t
bit avoids the extra MOV. It's unnecessary because btc
with a register destination masks the index with & 63
. High garbage is not a problem, but MSVC doesn't know this.
gcc和clang发出的代码与您期望的一样,但是gcc通过在rdx
中生成1ULL <<bit
并不得不复制到rax
来浪费MOV指令.
gcc and clang emit code like you'd expect, but with gcc wasting a MOV instruction by generating 1ULL <<bit
in rdx
and having to copy to rax
.
; clang output.
mov eax, 1
mov ecx, esi
shl rax, cl
xor rax, rdi
ret
这篇关于切换特定位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!