在一个位置或更低位置计算设置位的有效方法是什么? [英] What is the efficient way to count set bits at a position or lower?

查看:24
本文介绍了在一个位置或更低位置计算设置位的有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定 std::bitset<64>bits 具有任意数量的位设置和位位置 X (0-63)

Given std::bitset<64> bits with any number of bits set and a bit position X (0-63)

在 X 位置或更低位置计数或在 X 位置未设置时返回 0 的最有效方法是什么

What is the most efficient way to count bits at position X or lower or return 0 if the bit at X is not set

注意:如果设置了该位,则返回值始终至少为 1

蛮力方式很慢:

int countupto(std::bitset<64> bits, int X)
{
  if (!bits[X]) return 0;
  int total=1;
  for (int i=0; i < X; ++i)
  {
    total+=bits[i];
  }
  return total;
}

bitsetcount() 方法会给你所有比特的 popcount,但是 bitset不支持范围

The count() methof of bitset will give you the popcount of all the bits, but bitset does not support ranges

注意:这不是 如何计算 32 位整数中设置的位数? 因为它询问了不是范围 0 到 X 的所有位

Note: This is not a dup of How to count the number of set bits in a 32-bit integer? as that asks about all bits not the range 0 through X

推荐答案

这个 C++ 让 g++ 发出 非常好的 x86 ASM(godbolt 编译器资源管理器).我希望它也能在其他 64 位架构上有效地编译(如果有用于 std::bitset::count 的 HW popcount,否则这将始终是缓慢的部分;例如,一定要使用g++ -march=nehalem 或更高,或者 -mpopcnt 如果您不想启用任何其他功能,如果您可以将代码限制为仅在支持该功能的 CPU 上运行x86 指令):

This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count to use, otherwise that'll always be the slow part; e.g. sure to use g++ -march=nehalem or higher, or -mpopcnt if you don't want to enable anything else, if you can limit your code to only running on CPUs that support that x86 instruction):

#include <bitset>

int popcount_subset(std::bitset<64> A, int pos) {
  int high_bits_to_eliminate = 63 - pos;
  A <<= (high_bits_to_eliminate & 63);  // puts A[pos] at A[63].

  return (A[63]? ~0ULL : 0) & A.count();  // most efficient way: great code with gcc and clang
  // see the godbolt link for some #ifdefs with other ways to do the check, like
    // return A[BSET_SIZE-1] ? A.count() : 0;
}

这在 32 位架构上可能不是最佳选择,因此如果您需要进行 32 位构建,请比较其他替代方案.

This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.

这将适用于其他大小的位集,只要您对硬编码的 63 进行一些处理,并更改 &63 掩码用于将移位计数转换为更一般的范围检查.为了使用奇怪大小的位集获得最佳性能,请制作一个模板函数,专门针对目标机器的 size <= register width.在这种情况下,将 bitset 提取为适当宽度的 unsigned 类型,并移到寄存器的顶部而不是 bitset 的顶部.

This will work for other sizes of bitset, as long as you do something about the hard-coded 63s, and change the & 63 mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width of the target machine. In that case, extract the bitset to an unsigned type of the appropriate width, and shift to the top of the register instead of the top of the bitset.

您希望这也为 bitset<32> 生成理想的代码,但事实并非如此.gcc/clang 在 x86-64 上仍然使用 64 位寄存器.

You'd expect this to also generate ideal code for bitset<32>, but it doesn't quite. gcc/clang still use 64bit registers on x86-64.

对于大的bitsets,移动整个事物会比仅仅popcounting包含pos的词下面的词慢,然后在那个词上使用它.(如果您可以假设 SSSE3 而不是 popcnt insn 硬件支持,或者对于 32 位目标,那么矢量化 popcount 真正在 x86 上大放异彩.AVX2 256 位 pshufb 是最快的方式进行批量 popcounts,但没有 AVX2 我认为 64 位 popcnt 非常接近 128 位 pshufb 实现.更多讨论请参阅评论.)

For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos, and using this on that word. (This is where a vectorized popcount really shines on x86 if you can assume SSSE3 but not the popcnt insn hardware support, or for 32bit targets. AVX2 256bit pshufb is the fastest way to do bulk popcounts, but without AVX2 I think 64bit popcnt is pretty close to a 128-bit pshufb implementation. See the comments for more discussion.)

如果您有一个 64 位元素的数组,并且想要分别计算每个元素中某个位置以下的位数,那么您绝对应该使用 SIMD.该算法的移位部分矢量化,而不仅仅是 popcnt 部分.在基于 pshufb 的 popcnt 之后,使用 psadbw 对全零寄存器对 64 位块中的字节进行水平求和,该 popcnt 分别为每个字节中的位生成计数.SSE/AVX 没有 64 位算术右移,但您可以使用不同的技术来混合每个元素的高位.

If you have an array of 64-bit elements, and want to count bits below a certain position in each one separately, then you definitely should use SIMD. The shift parts of this algorithm vectorize, not just the popcnt part. Use psadbw against an all-zero register to horizontal-sum bytes in 64-bit chunks after a pshufb-based popcnt that produces counts for the bits in each byte separately. SSE/AVX doesn't have 64-bit arithmetic right shift, but you can use a different technique to blend on the high bit of each element.

你想让编译器输出的 asm 指令将:

The asm instructions you want to get the compiler to output will:

  1. 从 64 位值中删除不需要的位
  2. 测试所需位的最高位.
  3. 算了吧.
  4. 返回 0 或 popcount,具体取决于测试结果.(无分支或分支实现都有优点.如果分支是可预测的,无分支实现往往会更慢.)

<小时>

1 的明显方法是生成掩码 ((1<<(pos+1)) -1) 和 &它.一种更有效的方法是左移 63-pos,将您想要打包的位留在寄存器的顶部.


The obvious way to do 1 is to generate a mask ((1<<(pos+1)) -1) and & it. A more efficient way is to left-shift by 63-pos, leaving the bits you want packed at the top of a register.

这也有一个有趣的副作用,就是把你想测试的位作为寄存器的最高位.测试符号位,而不是任何其他任意位,需要的指令稍微少一些.算术右移可以将符号位广播到寄存器的其余部分,从而实现比通常更高效的无分支代码.

This also has the interesting side-effect of putting the bit you want to test as the top bit in the register. Testing the sign bit, rather than any other arbitrary bit, takes slightly fewer instructions. An arithmetic right shift can broadcast the sign bit to the rest of the register, allowing for more-efficient-than-usual branchless code.

执行popcount 是一个讨论较多的问题,但实际上是难题中更棘手的部分.在 x86 上,有非常高效的硬件支持,但仅限于最新的硬件.在 Intel CPU 上,popcnt 指令仅适用于 Nehalem 和更新版本.我忘记了 AMD 何时增加了支持.

Doing the popcount is a much-discussed problem, but is actually the trickier part of the puzzle. On x86, there is extremely efficient hardware support for it, but only on recent-enough hardware. On Intel CPUs, the popcnt instruction is only available on Nehalem and newer. I forget when AMD added support.

因此,为了安全地使用它,您需要使用不使用 popcnt 的回退进行 CPU 调度.或者,制作依赖/不依赖某些 CPU 功能的单独二进制文件.

So to use it safely, you need to either do CPU dispatching with a fallback that doesn't use popcnt. Or, make separate binaries that do/don't depend on some CPU features.

popcount 可以通过几种方式完成.一种使用 SSSE3 pshufb 来实现 4 位 LUT.不过,这在用于整个阵列时最有效,而不是一次使用单个 64b.标量比特黑客在这里可能是最好的,并且不需要 SSSE3(因此将与具有 64 位但不具有 pshufb 的古老 AMD CPU 兼容.)

popcount without the popcnt instruction can be done a few ways. One uses SSSE3 pshufb to implement a 4-bit LUT. This is most effective when used on a whole array, rather than a single 64b at a time, though. Scalar bithacks might be best here, and wouldn't require SSSE3 (and so would be compatible with ancient AMD CPUs that have 64bit but not pshufb.)

(A[63]? ~0ULL : 0) 要求编译器将高位广播到所有其他位位置,允许将其用作 AND 掩码为零(或不为) popcount 结果.请注意,即使对于大的 bitset 大小,它仍然只屏蔽 popcnt 的输出,而不是 bitset 本身,所以 ~0ULL 很好,我使用 ULL 来确保不是曾经要求编译器仅将位广播到寄存器的低 32b(例如,在 Windows 上使用 UL).

(A[63]? ~0ULL : 0) asks the compiler to broadcast the high bit to all other bit positions, allowing it to be used as an AND-mask to zero (or not) the popcount result. Note that even for large bitset sizes, it's still only masking the output of popcnt, not the bitset itself, so ~0ULL is fine I used ULL to make sure wasn't ever asking the compiler to broadcast the bit only to the low 32b of a register (with UL on Windows, for example).

这个广播可以通过算术右移 63 来完成,这会移动高位的副本.

This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.

clang 从原始版本生成此代码.在 Glenn 就 4 的不同实现提出一些建议后,我意识到我可以通过编写更像我想要的 ASM 的源代码来引导 gcc 走向 clang 的最佳解决方案.明显的 ((int64_t)something) >>63 更直接地请求算术右移不会严格可移植,因为带符号的右移是 实现定义为算术或逻辑.该标准不提供任何可移植的算术右移运算符.(这不是未定义行为,尽管如此.)无论如何,幸运的是编译器足够聪明:一旦你给它足够的提示,gcc 就会看到最好的方法.

clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63 to more directly request an arithmetic right shift would not be strictly portable, because signed right-shifts are implementation-defined as either arithmetic or logical. The standard doesn't provide any portable arithmetic right-shift operator. (It's not undefined behaviour, though.) Anyway, fortunately compilers are smart enough: gcc sees the best way once you give it enough of a hint.

这个源代码在 x86-64 和 ARM64 上使用 gcc 和 clang 编写了很棒的代码.两者都只是在 popcnt 的输入上使用算术右移(因此移位可以与 popcnt 并行运行).它还可以在带有 gcc 的 32 位 x86 上很好地编译,因为屏蔽只发生在 32 位变量上(在添加了多个 popcnt 结果之后).其余的函数在 32 位上很糟糕(当位集大于寄存器时).

This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).

带有 gcc 的原始三元运算符版本

使用 gcc 5.3.0 编译 -O3 -march=nehalem -mtune=haswell(较旧的 gcc,如 4.9.2,也仍然发出这个):

Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell (older gcc, like 4.9.2, also still emit this):

; the original ternary-operator version.  See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
    ; input bitset in rdi, input count in esi (SysV ABI)
    mov     ecx, esi    ; x86 variable-count shift requires the count in cl
    xor     edx, edx    ; edx=0 
    xor     eax, eax    ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
    not     ecx         ; two's complement bithack for 63-pos (in the low bits of the register)
    sal     rdi, cl     ; rdi << ((63-pos) & 63);  same insn as shl (arithmetic == logical left shift)
    popcnt  rdx, rdi
    test    rdi, rdi    ; sets SF if the high bit is set.
    cmovs   rax, rdx    ; conditional-move on the sign flag
    ret

如何证明 C 语句 -x、~x+1 和 ~(x-1) 产生相同的结果? gcc 使用 -x == ~ 的背景x + 1 补码恒等式.(还有 哪个 2 的补整数如果只需要结果的低部分,则可以在不将输入中的高位归零的情况下使用操作? 切切地提到 shl 掩盖了移位计数,所以我们只需要低位6 位 ecx 保存 63 - pos.主要是链接它,因为我最近写了它,任何仍在阅读这一段的人可能会觉得它很有趣.)

See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1 two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl masks the shift count, so we only need the low 6 bits of ecx to hold 63 - pos. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)

内联时,其中一些指令将消失.(例如,gcc 会首先在 ecx 中生成计数.)

Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)

使用 Glenn 的乘法而不是三元运算符 的想法(由 USE_mul 启用),gcc 做到了

With Glenn's multiply instead of ternary operator idea (enabled by USE_mul), gcc does

    shr     rdi, 63
    imul    eax, edi

在末尾而不是 xor/test/cmovs.

  • mov r,r:1 个融合域 uop,0 延迟,无执行单元
  • xor-zeroing:1 个融合域 uop,无执行单元
  • not:p0/p1/p5/p6 1 uop,1c 延迟,每 0.25c 吞吐量 1 个
  • shl(又名 sal)在 cl 中有计数:p0/p6 为 3 uops:2c 延迟,每 2c 吞吐量 1 个.(奇怪的是,Agner Fog 的数据表明 IvyBridge 只需要 2 个 uops.)
  • popcnt:p1 1 uop,3c 延迟,每 1c 吞吐量 1 个
  • shr​​ r,imm:p0/p6 1 uop,1c 延迟.每 0.5c 吞吐量 1 个.
  • imul r,r:p1、3c 延迟为 1uop.
  • 不算ret
  • mov r,r: 1 fused-domain uop, 0 latency, no execution unit
  • xor-zeroing: 1 fused-domain uop, no execution unit
  • not: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughput
  • shl (aka sal) with count in cl: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)
  • popcnt: 1 uop for p1, 3c latency, 1 per 1c throughput
  • shr r,imm: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.
  • imul r,r: 1uop for p1, 3c latency.
  • not counting the ret

总计:

  • 9 个融合域 uop,可以在 2.25 个周期内发出(理论上;uop 缓存行效应通常会略微限制前端).
  • p0/p6 的 4 个 uops(班次).p1 2 uop.1 个任意 ALU 端口 uop.可以每 2c 执行一个(使移位端口饱和),因此前端是最严重的瓶颈.
  • 9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).
  • 4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports), so the frontend is the worst bottleneck.

延迟:从比特集准备好到结果为:shl(2) -> popcnt(3) -> imul(3).总共8 个周期.或者当 pos 准备好时 9c,因为 not 是它的额外 1c 延迟.

Latency: Critical path from when the bitset is ready to when the result is: shl(2) -> popcnt(3) -> imul(3). Total 8 cycles. Or 9c from when pos is ready, because the not is an extra 1c latency for it.

最佳bitbroadcast 版本shr​​ 替换为sar(相同的性能)和imul 带有 (1c 延迟而不是 3c,在任何端口上运行).因此,唯一的性能变化是将关键路径延迟减少到 6 个周期.前端的吞吐量仍然是瓶颈.and 能够在任何端口上运行并没有什么区别,除非您将其与在 port1 上出现瓶颈的代码混合在一起(而不是查看仅运行this 紧密循环的代码).

The optimal bitbroadcast version replaces shr with sar (same perf), and imul with and (1c latency instead of 3c, runs on any port). So the only perf change is reducing the critical path latency to 6 cycles. Throughput is still bottlenecked on the frontend. and being able to run on any port doesn't make a difference, unless you're mixing this with code that bottlenecks on port1 (instead of looking at the throughput for running just this code in a tight loop).

cmov(三元运算符)版本:11 个融合域 uops(前端:每 2.75c 一个).执行单元:仍然在每个 2c 一个的移位端口 (p0/p6) 上遇到瓶颈.延迟:从 bitset 到 result 为 7c,从 pos 到 result 为 8c.(cmov 是 2c 延迟,对于 p0/p1/p5/p6 中的任何一个都是 2 uops.)

cmov (ternary operator) version: 11 fused-domain uops (frontend: one per 2.75c). execution units: still bottlenecked on the shift ports (p0/p6) at one per 2c. Latency: 7c from bitset to result, 8c from pos to result. (cmov is 2c latency, 2 uops for any of p0/p1/p5/p6.)

Clang 有一些不同的技巧:代替 test/cmovs,它生成一个全一或全的掩码-zeros 通过使用算术右移将符号位广播到寄存器的所有位置.我喜欢它:在英特尔上使用 and 而不是 cmov 更有效.不过,它仍然具有数据依赖性,并为分支的两端工作(这是 cmov 的主要缺点).更新:如果源代码正确,gcc 也会使用这种方法.

Clang has some different tricks up its sleeve: Instead of test/cmovs, it generates a mask of either all-ones or all-zeros by using an arithmetic right-shift to broadcast the sign bit to all positions of a register. I love it: Using and instead of cmov is more efficient on Intel. It still has the data-dependency and does the work for both sides of the branch (which is the main downside to cmov in general), though. Update: with the right source code, gcc will use this method, too.

clang 3.7 -O3 -Wall -march=nehalem -mtune=haswell

popcount_subset(std::bitset<64ul>, int):
    mov     ecx, 63
    sub     ecx, esi      ; larger code size, but faster on CPUs without mov-elimination
    shl     rdi, cl       ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi      ; doesn't start a fresh dep chain before this, like gcc does
    sar     rdi, 63       ; broadcast the sign bit
    and     eax, edi      ; eax = 0 or its previous value
    ret

sar/and 替换了 xor/test/cmov,并且 cmov 是 Intel CPU 上的 2-uop 指令,所以这真的很好.(对于三元运算符版本).

sar / and replaces xor / test / cmov, and cmov is a 2-uop instruction on Intel CPUs, so that's really nice. (For the ternary-operator version).

在使用乘法源版本或bitbroadcast"源版本时,

Clang 仍然使用 sar/和 技巧,而不是实际的 imul.所以那些帮助 gcc 而不伤害 clang.(sar/and 肯定比 shr​​/imul 好:关键路径上的延迟减少了 2c.)pow_of_two_sub 版本确实伤害了 clang(参见第一个godbolt链接:从这个答案中省略,以避免与没有成功的想法混淆).

Clang still does the sar / and trick instead of an actual imul when using the multiply source version, or the "bitbroadcast" source version. So those help gcc without hurting clang. (sar/and is definitely better than shr/imul: 2c less latency on the critical path.) The pow_of_two_sub version does hurt clang (see the first godbolt link: omitted from this answer to avoid clutter with ideas that didn't pan out).

mov ecx, 63/sub ecx, esi 实际上在 CPU 上更快,无需消除 reg,reg 移动(零延迟和无执行端口,由寄存器重命名处理).这包括 Intel IvyBridge 之前的版本,但不包括更新的 Intel 和 AMD CPU.

The mov ecx, 63 / sub ecx, esi is actually faster on CPUs without mov-elimination for reg,reg moves (zero latency and no execution port, handled by register renaming). This includes Intel pre-IvyBridge, but not more recent Intel and AMD CPUs.

Clang 的 mov imm/sub 方法只将 pos 的一个周期延迟放到关键路径上(超出 bitset->result 延迟),而不是 mov ecx, esi/not ecxmov r,r 具有 1c 延迟的 CPU 上的两个.

Clang's mov imm / sub method puts only one cycle of latency for pos onto the critical path (beyond the bitset->result latency), instead of two for a mov ecx, esi / not ecx on CPUs where mov r,r has 1c latency.

使用 BMI2(Haswell 及更高版本),最佳 ASM 版本可以将 mov 保存到 ecx.其他一切都一样,因为 shlx 将其移位​​计数输入寄存器屏蔽到操作数大小,就像 shl 一样.

With BMI2 (Haswell and later), an optimal ASM version can save a mov to ecx. Everything else works the same, because shlx masks its shift-count input register down to the operand-size, just like shl.

x86 移位指令具有疯狂的 CISC 语义,如果移位计数为零,则标志不受影响.因此,可变计数移位指令对标志的旧值具有(潜在的)依赖性.普通"x86 shl r, cl 在 Haswell 上解码为 3 uops,但 BMI2 shlx r, r, r 仅为 1.所以 gcc 仍然发出 sal-march=haswell 一起使用,而不是使用 shlx(它确实在其他一些情况下使用).

x86 shift instructions have crazy CISC semantics where if the shift count is zero, the flags aren't affected. So variable-count shift instructions have a (potential) dependency on the old value of the flags. "Normal" x86 shl r, cl decodes to 3 uops on Haswell, but BMI2 shlx r, r, r is only 1. So it's too bad that gcc still emits sal with -march=haswell, instead of using shlx (which it does use in some other cases).

// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
    not     esi           ; The low 6 bits hold 63-pos.  gcc's two-s complement trick
    xor     eax, eax      ; break false dependency on Intel.  maybe not needed when inlined.
    shlx    rdi, rdi, rsi ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi
    sar     rdi, 63       ; broadcast the sign bit: rdi=0 or -1
    and     eax, edi      ; eax = 0 or its previous value
    ret

英特尔 Haswell 的性能分析:6 个融合域 uops(前端:每 1.5c 一个).执行单元:2 p0/p6 shift uops.1 p1 uop.2 个任意端口 uops:(从总执行端口限制中每 1.25c 一个).关键路径延迟:shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result.(或来自 pos->result 的 6c).

Perf analysis for Intel Haswell: 6 fused-domain uops (frontend: one per 1.5c). Execution units: 2 p0/p6 shift uops. 1 p1 uop. 2 any-port uops: (one per 1.25c from total execution port limits). Critical path latency: shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result. (or 6c from pos->result).

请注意,内联时,人类(或智能编译器)可以避免对 xor eax, eax 的需要.它只是因为 popcnt 对输出寄存器的错误依赖(在 Intel 上),我们需要 eax 中的输出(调用者可能最近使用了很长时间)链).使用 -mtune=bdver2 或其他东西,gcc 不会将用于 popcnt 输出的寄存器清零.

Note that when inlining, a human (or smart compiler) could avoid the need for the xor eax, eax. It's only there because of popcnt's false dependency on the output register (on Intel), and we need the output in eax (which the caller may have used recently for a long dep chain). With -mtune=bdver2 or something, gcc won't zero the register it's going to use for popcnt output.

内联时,我们可以使用一个输出寄存器,该寄存器至少早在 popcnt 的源代码中就已经准备好了,以避免出现问题.当以后不需要源时,编译器将执行就地 popcnt rdi,rdi ,但这里不是这种情况.相反,我们可以选择另一个在源之前已经准备好的寄存器.popcnt的输入依赖于63-pos,我们可以破坏它,所以popcnt rsi,rdi对rsi的依赖不能延迟它.或者如果我们在寄存器中有 63,我们可以 popcnt rsi,rdi/sarx rax, rsi, reg_63/ and eax,esi.或者 BMI2 3 操作数移位指令也可以让我们不破坏输入,以防以后需要它们.

When inlining, we could use an output register that already has to be ready at least as early as popcnt's source reg to avoid the problem. Compilers will do an in-place popcnt rdi,rdi when the source isn't needed later, but that's not the case here. Instead, we can pick another register that already has to be ready before the source. popcnt's input depends on 63-pos, and we can clobber it, so popcnt rsi,rdi's dependency on rsi can't delay it. Or if we had 63 in a register, we could popcnt rsi,rdi / sarx rax, rsi, reg_63 / and eax, esi. Or BMI2 3-operand shift instructions would also let us not clobber inputs in case they're needed afterwards.

这太轻了,以至于循环开销和设置输入操作数/存储结果将成为主要因素.(并且 63-pos 可以使用编译时常量进行优化,或者优化到变量计数的来源.)

This is so light-weight that loop overhead and setting up the input operands / storing the results are going to be major factors. (And the 63-pos can optimize away with a compile-time constant, or into wherever a variable count comes from.)

英特尔编译器很有趣,它没有利用 A[63] 是符号位这一事实.shl/bt rdi, 63/jc.它甚至以一种非常愚蠢的方式设置了分支.它可以将 eax 归零,然后根据 shl 设置的符号标志跳过 popcnt 或不跳过.

The Intel compiler amusingly shoots itself in the foot and doesn't take advantage of the fact that A[63] is the sign bit. shl / bt rdi, 63 / jc. It even sets up the branches in a really dumb way. It could zero eax, and then jump over popcnt or not based on the sign flag set by shl.

最佳分支实现,从 Godbolt 上 -O3 -march=corei7 的 ICC13 输出开始:

An optimal branching implementation, starting from ICC13 output from -O3 -march=corei7 on godbolt:

   // hand-tuned, not compiler output
        mov       ecx, esi    ; ICC uses neg/add/mov :/
        not       ecx
        xor       eax, eax    ; breaks the false dep, or is the return value in the taken-branch case
        shl       rdi, cl
        jns    .bit_not_set
        popcnt    rax, rdi
.bit_not_set:
        ret

这几乎是最优的:A[pos] == true 情况有一个未被采用的分支.不过,与无分支方法相比,它并没有节省多少.

That's pretty much optimal: The A[pos] == true case has one not-taken branch. It doesn't save very much over the branchless method, though.

如果 A[pos] == false 情况更常见:跳过 ret 指令,到 popcnt/>ret.(或在内联之后:跳转到执行 popcnt 并跳回的末尾块).

If the A[pos] == false case is more common: jump over a ret instruction, to a popcnt / ret. (Or after inlining: jump to a block at the end that does the popcnt and jumps back).

这篇关于在一个位置或更低位置计算设置位的有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆