切换特定位 [英] Toggle a Specific Bit

查看:95
本文介绍了切换特定位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我看到了类似的问题 在ith正子上切换

So I have seen the questions like toggle a bit at ith positon and How do you set, clear, and toggle a single bit?, but I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?

我尝试用C编写它并浏览程序集,但并不太清楚为什么会有某些东西.

I tried writing it in C and looking through the assembly and don't quite understand exactly why there are some things that are there.

C:

unsigned long toggle(unsigned long num, unsigned long bit)
{
  num ^= 1 << bit;
  return num;
}

int main()
{
  printf("%ld\n", toggle(100, 60));
  return 0;
}

从GDB切换功能汇编:

Toggle function assembly from GDB:

<toggle>
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-0x8],rdi
mov QWORD PTR [rbp-0x10],rsi
mov rax, QWORD PTR [rbp-0x10]
mov edx, 0x1
mov ecx, eax
shl edx, cl
mov eax, edx
cdqe
xor QWORD PTR [rbp-0x8],rax
mov rax, QWORD PTR [rbp-0x8]
pop rbp
ret

有人可以引导我了解汇编级别的情况,以便我可以更好地理解这一点,并在x86-64中编写自己的切换功能吗?

Can someone walk me through what's going on on the assembly level so I can better understand this and write my own toggle function in x86-64?

推荐答案

我想知道在x86-64组件中是否有一种很好的方法可以在ith位置进行切换?

I was wondering if there was a good way to toggle a bit in the ith position in x86-64 assembly?

是的, x86的BTC(位测试和补码)指令完全可以做到(以及将CF设置为该位的旧值),并在所有现代CPU上高效运行.

Yes, x86's BTC (Bit Test and Complement) instruction does exactly that (as well as setting CF to the old value of the bit), and runs efficiently on all modern CPUs.

  • Intel SnB系列:1 uop,1c延迟,每个时钟吞吐量2个. (Nehalem和更早的版本:每个时钟1个)
  • Silvermont/KNL:1 uop,1c延迟,每时钟吞吐量1个.
  • AMD Ryzen:2微秒,2c延迟,每个时钟吞吐量2个
  • AMD Bulldozer系列/美洲虎:2 uops,2c延迟,每时钟吞吐量1个
  • AMD K8/K10:2微秒,2c延迟,每时钟吞吐量1个

来源: Agner Fog的指令表和x86优化指南.另请参见标签Wiki的其他性能链接.

Source: Agner Fog's instruction tables and x86 optimization guide. See also other performance links in the x86 tag wiki.

toggle:
    mov  rax, rdi
    btc  rax, rsi
    ret

(如果您用C正确编写了toggle).

(If you'd written toggle correctly in C).

不要将btc与内存操作数一起使用:位字符串指令具有疯狂的CISC语义,其中位索引不限于寻址模式选择的dword内. (因此,btc m,r是10微秒,在Skylake上每5c吞吐量中有一个).但是使用寄存器操作数,移位计数将完全像变量计数移位一样被屏蔽.

Don't use btc with a memory operand: the bit-string instructions have insane CISC semantics where the bit-index isn't limited to within the dword selected by the addressing mode. (So btc m,r is 10 uops with one per 5c throughput on Skylake). But with a register operand, the shift-count is masked exactly like variable-count shifts.

不幸的是,即使使用-march=haswell-mtune=intel,gcc和clang也缺少此窥孔优化.即使在AMD上也值得使用,但在Intel上甚至更有效.

Unfortunately gcc and clang miss this peephole optimization, even with -march=haswell or -mtune=intel. It's worth using even on AMD, but it's even more efficient on Intel.

btcxor慢的AMD CPU上,值得在寄存器中生成掩码并使用xor.甚至在Intel CPU上,在内存中切换一点也是值得的. (内存目标xor比内存目标btc好得多.)

On AMD CPUs where btc is slower than xor, it's worth generating the mask in a register and using xor. Or even on Intel CPUs, this is worth it to toggle a bit in memory. (memory-destination xor is much better than memory-destination btc).

对于数组中的多个元素,请使用SSE2 pxor.您可以使用以下方法生成蒙版:

For multiple elements in an array, use SSE2 pxor. You can generate the mask with:

pcmpeqd  xmm0, xmm0        ; -1 all bits set
psrlq    xmm0, 63          ;  1 just a single bit set

movd     xmm1, esi
psllq    xmm0, xmm1        ; 1<<bit


; then inside a loop, with data in xmm1
pxor     xmm1, xmm0        ; flip bit in each qword element


不太清楚为什么会有一些东西.

don't quite understand exactly why there are some things that are there.

所有这些废话之所以存在,是因为您未经优化就进行了编译,并且使用了带符号的int常量.

All that crap is there because you compiled without optimization, and because you used a signed int constant.

-O0代码到内存的所有溢出/重新加载甚至都不值得一看.如果您想要不烂的代码,请使用-O3 -march=native进行编译.

It's not even worth looking at all the spill/reload to memory from the -O0 code. Compile with -O3 -march=native if you want code that doesn't suck.

另请参见如何删除噪声" Matt Godbolt的CppCon2017演讲中:我的编译器完成了什么?我最近取消编译器的盖子" ,以很好地介绍编译器生成的asm.

See also How to remove "noise" from GCC/clang assembly output?, and Matt Godbolt's CppCon2017 talk: "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid" for a good intro to looking at compiler-generated asm.

使用带符号的int常量1 << bit解释了为什么gcc进行了32位移位,然后又进行了cdqe移位. num ^= 1 << bit;等效于

Using the signed int constant 1 << bit explains why gcc did a 32-bit shift and then cdqe. num ^= 1 << bit; is equivalent to

int mask = 1;
mask <<= bit;   // still signed int
num ^= mask;    // mask is sign-extended to 64-bit here.

在gcc -O3输出中,我们得到

In gcc -O3 output, we get

    mov     edx, 1
    sal     edx, cl           # 1<<bit   (32-bit)
    movsx   rax, edx          # sign-extend, like cdqe does for eax->rax
    xor     rax, rdi


如果我们坚定地写toggle:

uint64_t toggle64(uint64_t num, uint32_t bit) {
  num ^= 1ULL << bit;
  return num;
}

(source+asm on the Godbolt compiler explorer)

gcc和clang仍然会错过使用btc的功能,但这并不可怕.有趣的是,MSVC确实发现了btc窥孔,但浪费了MOV指令:

gcc and clang still miss using btc, but it's not horrible. Interestingly, MSVC does spot the btc peephole, but wastes a MOV instruction:

toggle64 PROC
    mov      eax, edx
    btc      rcx, rax
    mov      rax, rcx
    ret      0

使用uint64_t位可避免产生额外的MOV.这是不必要的,因为具有寄存器目标的btc& 63掩盖了索引.高垃圾率不是问题,但是MSVC不知道这一点.

Using uint64_t bit avoids the extra MOV. It's unnecessary because btc with a register destination masks the index with & 63. High garbage is not a problem, but MSVC doesn't know this.

gcc和clang发出的代码与您期望的一样,但是gcc通过在rdx中生成1ULL <<bit并不得不复制到rax来浪费MOV指令.

gcc and clang emit code like you'd expect, but with gcc wasting a MOV instruction by generating 1ULL <<bit in rdx and having to copy to rax.

 ; clang output.
    mov     eax, 1
    mov     ecx, esi
    shl     rax, cl
    xor     rax, rdi
    ret

这篇关于切换特定位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆