什么是有效的方法来计算一个位置或下置位? [英] What is the efficient way to count set bits at a position or lower?

查看:241
本文介绍了什么是有效的方法来计算一个位置或下置位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于和std :: bitset< 64 GT;位与任意数量的设置位和位的位置 X (0-63)

Given std::bitset<64> bits with any number of bits set and a bit position X (0-63)

什么是计数在位置X或较低位或返回0,如果在X中的位未设置的最有效的方法

What is the most efficient way to count bits at position X or lower or return 0 if the bit at X is not set

注:如果该位被置回报率将始终是至少1

蛮力的方法是很慢的:

int countupto(std::bitset<64> bits, int X)
{
  if (!bits[X]) return 0;
  int total=1;
  for (int i=0; i < X; ++i)
  {
    total+=bits[i];
  }
  return total;
}

计数() methof 位集会给你的 popcount 所有位,但位集不支持范围

The count() methof of bitset will give you the popcount of all the bits, but bitset does not support ranges

注意:这不是<一个一个DUP href=\"http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer\">How计数集的比特数在32位整数?中作为经X询问所有位不范围为0

Note: This is not a dup of How to count the number of set bits in a 32-bit integer? as that asks about all bits not the range 0 through X

推荐答案

本C ++得到G ++发出<一个href=\"http://gcc.godbolt.org/#compilers:!((compiler:g530,options:'-std%3Dgnu%2B%2B1y++-O3+-Wall+-march%3Dnehalem+-mtune%3Dhaswell',source:'%23include+%3Cbitset%3E%0A%23include+%3Calgorithm%3E%0A%0A//%23define+USE_mul++++//+instead+of+the+ternary+operator.++same+$c$c+with+clang%0A//%23define+USE_pow_of_two_sub%0A%23define+USE_bitbroadcast+++++++//+get+gcc+to+use+clang!'s+arithmetic+right+shift+trick%0A%0A%0A//+only+works+correctly+for+bitset+size+%3D+64,+but+parameterize+for+viewing+compiler+output+(without+templates)%0A%23define+BSET_SIZE+64U%0A%0Aint+popcount_subset(std::bitset%3CBSET_SIZE%3E+A,+int+pos)+%7B%0A++int+high_bits_to_eliminate+%3D+BSET_SIZE-1+-+pos%3B%0A++A+%3C%3C%3D+(high_bits_to_eliminate+%26+63)%3B++//+puts+A%5Bpos%5D+at+A%5B63%5D.%0A++//A+%3C%3C%3D+std::min(+(unsigned)high_bits_to_eliminate,+BSET_SIZE-1)%3B++//+bad+$c$c+that+doesn!'t+mask%0A%23ifdef+USE_mul%0A++return+A%5BBSET_SIZE-1%5D+*+A.count()%3B%0A%23elif+defined(USE_pow_of_two_sub)%0A++return+(+(1U%3C%3C14)+-+A%5BBSET_SIZE-1%5D)+%26+A.count()%3B+++//+all-ones+by+subtracting+from+a+power+of+2.++No+advantages+over+bitbroadcast%0A%23elif+defined(USE_bitbroadcast)%0A++return+(A%5BBSET_SIZE-1%5D%3F+~0ULL+:+0)+%26+A.count()%3B++//+most+efficient+way:+great+$c$c+with+gcc+and+clang%0A%23else%0A++return+A%5BBSET_SIZE-1%5D+%3F+A.count()+:+0%3B%0A%23endif%0A%7D%0A')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3\"相对=nofollow>很好的x86 ASM(编译godbolt探险家)。我希望这将有效地收集关于其他64位架构,太(如果有一个硬件popcount为和std :: bitset ::计数来使用,否则那将永远是慢部分):

This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count to use, otherwise that'll always be the slow part):

#include <bitset>

int popcount_subset(std::bitset<64> A, int pos) {
  int high_bits_to_eliminate = 63 - pos;
  A <<= (high_bits_to_eliminate & 63);  // puts A[pos] at A[63].

  return (A[63]? ~0ULL : 0) & A.count();  // most efficient way: great code with gcc and clang
  // see the godbolt link for some #ifdefs with other ways to do the check, like
    // return A[BSET_SIZE-1] ? A.count() : 0;
}

这可能不是在32位架构的优化,因此,如果您需要32位的版本相比较其他的替代品。

This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.

将用于其他尺寸工作的bitset ,只要你做一些事情的硬codeD 63 s和修改&安培; 63 掩码移位计数到更广泛的范围内检查。对于陌生的尺寸位集最佳性能,要与尺寸℃的专业化的模板功能; =注册目标机器的宽度。在这种情况下,提取的bitset到无符号类型选择适当的宽度,并转移到寄存器,而不是位集顶部的顶部。

This will work for other sizes of bitset, as long as you do something about the hard-coded 63s, and change the & 63 mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width of the target machine. In that case, extract the bitset to an unsigned type of the appropriate width, and shift to the top of the register instead of the top of the bitset.

您会想到这也产生理想的code为位集&LT; 32 GT; ,但它并不完全。它仍然在使用x86-64的64位寄存器。

You'd expect this to also generate ideal code for bitset<32>, but it doesn't quite. It still uses 64bit registers on x86-64.

有关大位集,移整个事情将不仅仅是popcounting低于含 POS 的一个词,并使用此对这个词比较慢。 (这是一个矢量popcount真正的亮点,如果你可以假设SSSE3,而不是 POPCNT insn的硬件支持,或32位的目标。IIRC,64 POPCNT 通常是做散货popcounts最快的方式,也许除了AVX2 256bit的 PSHUFB )查看评论更多的讨论。

For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos, and using this on that word. (This is where a vectorized popcount really shines if you can assume SSSE3 but not the popcnt insn hardware support, or for 32bit targets. IIRC, 64bit popcnt is normally the fastest way to do bulk popcounts, except maybe AVX2 256bit pshufb.) See the comments for more discussion.

该汇编指令你想获得编译器的输出会:

The asm instructions you want to get the compiler to output will:


  1. 从64位值删除不需要的位

  2. 测试最高通缉位。

  3. popcount它。

  4. 返回0或popcount,根据测试的结果。 (网点或分支的实现都具有优势。如果分支为predictable,一个网点实现往往要慢一些。)

最明显的方式做的 1 是生成一个掩码((1 LT;&LT;(POS + 1))-1 )和&安培; 它。一个更有效的方法是由 63-POS 来左移,让你想要的包装,寄存器的顶部。


The obvious way to do 1 is to generate a mask ((1<<(pos+1)) -1) and & it. A more efficient way is to left-shift by 63-pos, leaving the bits you want packed at the top of a register.

这也有将要测试在寄存器中的最高位位的有趣的副作用。测试符号位,而不是任何其他任意位,需要稍微较少的指令。算术右移可以播放符号位寄存器的休息,从而更高效,比平常网点code。

This also has the interesting side-effect of putting the bit you want to test as the top bit in the register. Testing the sign bit, rather than any other arbitrary bit, takes slightly fewer instructions. An arithmetic right shift can broadcast the sign bit to the rest of the register, allowing for more-efficient-than-usual branchless code.

在做的 popcount 是一个讨论得最多的问题,但实际上是谜题的棘手的部分。在x86上,有一个极其高效的硬件支持,但仅限于近期-足够的硬件。在Intel的CPU,在 POPCNT 指令仅适用于Nehalem处理器和更新。我忘了,当AMD增加了支持。

Doing the popcount is a much-discussed problem, but is actually the trickier part of the puzzle. On x86, there is extremely efficient hardware support for it, but only on recent-enough hardware. On Intel CPUs, the popcnt instruction is only available on Nehalem and newer. I forget when AMD added support.

因此​​,要安全地使用它,你需要做的任何与CPU不使用 POPCNT 备用调度。或者,让独立的二进制文件做/不依赖于某些CPU功能。

So to use it safely, you need to either do CPU dispatching with a fallback that doesn't use popcnt. Or, make separate binaries that do/don't depend on some CPU features.

没有 POPCNT 指令popcount可以做的一些方法。一种使用SSSE3 PSHUFB 来实现一个4位LUT。整个阵列上使用时,而不是在一个单一的时间64B,这是最有效的,虽然。标bithacks可能是最好的这里,并且不会要求SSSE3(等将与有64位,但不PSHUFB古AMD的CPU兼容的。)

popcount without the popcnt instruction can be done a few ways. One uses SSSE3 pshufb to implement a 4-bit LUT. This is most effective when used on a whole array, rather than a single 64b at a time, though. Scalar bithacks might be best here, and wouldn't require SSSE3 (and so would be compatible with ancient AMD CPUs that have 64bit but not pshufb.)

(A [63]〜0ULL:0)要求编译器广播的高位的所有其他比特位置,允许它被用作AND-面膜为零(或没有)的popcount结果。请注意,即使对于大位集的大小,它仍然只是屏蔽的输出 POPCNT ,而不是位集本身,因此〜0ULL 是好的我用ULL,以确保没有以往任何时候都要求编译器来播放该位只对寄存器的低32B(与 UL 在Windows上,例如)

(A[63]? ~0ULL : 0) asks the compiler to broadcast the high bit to all other bit positions, allowing it to be used as an AND-mask to zero (or not) the popcount result. Note that even for large bitset sizes, it's still only masking the output of popcnt, not the bitset itself, so ~0ULL is fine I used ULL to make sure wasn't ever asking the compiler to broadcast the bit only to the low 32b of a register (with UL on Windows, for example).

这广播可以通过63算术右移,从而转移在高位的副本来实现。

This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.

产生铛从原来的版本code。从格伦有关不同实现一定的催促后的 4 后,我意识到,我可以通过编写源更像是我想要的ASM导致对铛的最佳解决方案GCC。最明显的((的int64_t)的东西)&GT;&GT; 63 更直接请求算术右移的是未定义行为,但幸运的是编译器足够聪明的海湾合作​​委员会认为最好的方式,一旦你给它一个暗示。

clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63 to more directly request an arithmetic right shift would be undefined behaviour, but fortunately compilers are smart enough that gcc sees the best way once you give it a hint.

这使得源上的x86-64和ARM64用gcc和铿锵大code。既简单地使用一个算术右移对输入到POPCNT(这样的移位可以在平行于POPCNT运行)。它还对编译与海湾合作委员会的32位x86的伟大,是因为掩蔽只发生在一个32位变量(多POPCNT结果相加后)。这是函数的其余部分这是在32位讨厌(当位集比寄存器大)。

This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).

原三元运营商的版本用gcc

用gcc编译5.3.0 -O3 -march = Nehalem的-mtune = Haswell的(旧有的GCC,像4.9.2,也依旧排出这一点):

Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell (older gcc, like 4.9.2, also still emit this):

; the original ternary-operator version.  See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
    ; input bitset in rdi, input count in esi (SysV ABI)
    mov     ecx, esi    ; x86 variable-count shift requires the count in cl
    xor     edx, edx    ; edx=0 
    xor     eax, eax    ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
    not     ecx         ; two's complement bithack for 63-pos (in the low bits of the register)
    sal     rdi, cl     ; rdi << ((63-pos) & 63);  same insn as shl (arithmetic == logical left shift)
    popcnt  rdx, rdi
    test    rdi, rdi    ; sets SF if the high bit is set.
    cmovs   rax, rdx    ; conditional-move on the sign flag
    ret

请参阅How证明C语句,-x,〜X + 1,和〜(X-1)产生相同的结果?对GCC的使用背景 -x = =〜X + 1 补身份。 (和<一个href=\"http://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in\">Which 2的补整数运算可以在不输入零高位,如果只有结果的低位部分是想用?这相切提到 SHL 口罩移位计数,所以我们只需要 ECX 低6位持有 63 - POS 清除链接,因为我最近写了它,任何人都还在读这一段会觉得很有趣。)

See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1 two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl masks the shift count, so we only need the low 6 bits of ecx to hold 63 - pos. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)

其中的一些指令内联时会自行消失。 (例如GCC将产生在首位ECX的数量。)

Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)

随着格伦的乘法,而不是三元运算符的想法(通过启用 USE_mul ),GCC确实

With Glenn's multiply instead of ternary operator idea (enabled by USE_mul), gcc does

    shr     rdi, 63
    imul    eax, edi

在最后,而不是 XOR / 测试 / cmovs


  • MOV R,R :1融合域UOP,0延迟,没有执行单元

  • XOR -zeroing:1融合域UOP,没有执行单元

  • 不是:1 UOP为P0 / P1 / P5 / P6,1C延迟,1元0.25C吞吐量

  • SHL (又名 SAL )以计数 CL :3微指令为P0 / P6:2C延迟,1元2C吞吐量。 (瓦格纳雾的数据表明,IvyBridge的只需要2微指令为此,奇怪的。)

  • POPCNT :1 UOP为P1,3C延迟,1元1C吞吐量

  • SHR R,IMM :1 UOP为P0 / P6,1C延迟。 1元0.5C的吞吐量。

  • IMUL R,R :1uop为P1,3C延迟

  • 不算 RET

  • mov r,r: 1 fused-domain uop, 0 latency, no execution unit
  • xor-zeroing: 1 fused-domain uop, no execution unit
  • not: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughput
  • shl (aka sal) with count in cl: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)
  • popcnt: 1 uop for p1, 3c latency, 1 per 1c throughput
  • shr r,imm: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.
  • imul r,r: 1uop for p1, 3c latency.
  • not counting the ret

总计:


  • 9融合域微指令,能问题2.25个周期。(理论上,UOP缓存线效应通常瓶颈前端略)

  • 4微指令(班)P0 / P6。 2微指令为P1。 1任意ALU端口UOP。可以执行在每个图2c酮(饱和移位端口),所以前端是最差的瓶颈。

  • 9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).
  • 4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports), so the frontend is the worst bottleneck.

延迟:从当位集准备的时候,结果是关键路径: SHL (2) - > POPCNT (3) - > IMUL (3)。总的 8个周期。或者当 POS 9C是准备好了,因为不是是它的一个额外1C延迟。

Latency: Critical path from when the bitset is ready to when the result is: shl(2) -> popcnt(3) -> imul(3). Total 8 cycles. Or 9c from when pos is ready, because the not is an extra 1c latency for it.

优化 bitbroadcast 版本替换 SHR SAR (同​​PERF)和 IMUL (1C延迟而不是3C,运行任何端口上)。所以,唯一的PERF变化的减少关键路径的延迟〜6个周期。吞吐量仍然瓶颈的前端。 能够运行于任何端口不有所作为,除非你用code的端口1瓶颈(而不是看着吞吐量这种混合对于刚刚运行的这个的$ C $在紧凑循环C)。

The optimal bitbroadcast version replaces shr with sar (same perf), and imul with and (1c latency instead of 3c, runs on any port). So the only perf change is reducing the critical path latency to 6 cycles. Throughput is still bottlenecked on the frontend. and being able to run on any port doesn't make a difference, unless you're mixing this with code that bottlenecks on port1 (instead of looking at the throughput for running just this code in a tight loop).

CMOV(三元运算符)版本:11融合域微指令(前端:每2.75c有一个)。执行单元:仍瓶颈上以每2C一个转变端口(P0 / P6)。 延迟:导致从位集7C,导致从8C POS机。 ( CMOV 2C是延迟,2微指令任何的P0 / P1 / P5 / P6)。

cmov (ternary operator) version: 11 fused-domain uops (frontend: one per 2.75c). execution units: still bottlenecked on the shift ports (p0/p6) at one per 2c. Latency: 7c from bitset to result, 8c from pos to result. (cmov is 2c latency, 2 uops for any of p0/p1/p5/p6.)

有一些不同的招数了它的袖子:与其测试 / cmovs 时,它产生或全1或通过使用一个算术右移广播符号位到寄存器的所有位置全零的掩模。我喜欢它:使用而不是 CMOV 的是英特尔更有效。它仍然具有数据依存性以及做为分支(这是一般地CMOV的主要缺点)的两侧的工作,虽然。更新:用正确的源$ C ​​$ C,gcc会使用这种方法,太

Clang has some different tricks up its sleeve: Instead of test/cmovs, it generates a mask of either all-ones or all-zeros by using an arithmetic right-shift to broadcast the sign bit to all positions of a register. I love it: Using and instead of cmov is more efficient on Intel. It still has the data-dependency and does the work for both sides of the branch (which is the main downside to cmov in general), though. Update: with the right source code, gcc will use this method, too.

<一个href=\"http://gcc.godbolt.org/#compilers:!((compiler:clang37x,options:'-std%3Dgnu%2B%2B1y++-O3+-Wall+-march%3Dnehalem+-mtune%3Dhaswell',source:'%23include+%3Cbitset%3E%0A%0A%23define+USE_mul++++//+instead+of+the+ternary+operator.++same+$c$c+with+clang%0A%0Aint+popcount_subset(std::bitset%3C64%3E+A,+int+pos)+%7B%0A++int+high_bits_to_eliminate+%3D+63+-+pos%3B%0A++A+%3C%3C%3D+(high_bits_to_eliminate+%26+63)%3B++//+puts+A%5Bpos%5D+at+A%5B63%5D.%0A%23ifdef+USE_mul%0A++return+A%5B63%5D+*+A.count()%3B%0A%23else%0A++return+A%5B63%5D+%3F+A.count()+:+0%3B%0A%23endif%0A%7D%0A')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3\"相对=nofollow>铛3.7 -O3 -march -Wall Nehalem的= = -mtune的Haswell

popcount_subset(std::bitset<64ul>, int):
    mov     ecx, 63
    sub     ecx, esi      ; larger code size, but faster on CPUs without mov-elimination
    shl     rdi, cl       ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi      ; doesn't start a fresh dep chain before this, like gcc does
    sar     rdi, 63       ; broadcast the sign bit
    and     eax, edi      ; eax = 0 or its previous value
    ret

SAR /和替换 XOR /测试/ CMOV CMOV 是英特尔CPU的2微指令的指令,所以这是非常好的。 (三元运营商的版本)。

sar / and replaces xor / test / cmov, and cmov is a 2-uop instruction on Intel CPUs, so that's really nice. (For the ternary-operator version).

锵仍对 SAR /和欺骗,而不是实际的 IMUL 使用乘法源版本时,或在bitbroadcast源版本。因此,那些帮助GCC不伤害铛。 ( SAR /和绝对比 SHR / IMUL 更好:2C少在关键路径上的延迟)的 pow_of_two_sub 版本不伤害铛(见第一godbolt链接:​​从这个答案省略,以避免杂乱的想法是没有成功)。

Clang still does the sar / and trick instead of an actual imul when using the multiply source version, or the "bitbroadcast" source version. So those help gcc without hurting clang. (sar/and is definitely better than shr/imul: 2c less latency on the critical path.) The pow_of_two_sub version does hurt clang (see the first godbolt link: omitted from this answer to avoid clutter with ideas that didn't pan out).

MOV ECX,63 / 子ECX,ESI 其实就是更快的上无MOV淘汰的CPU为章,章移动(零延迟和不执行端口,通过寄存器重命名处理)。这包括英特尔pre-IvyBridge的,但不是最近Intel和AMD的CPU。

The mov ecx, 63 / sub ecx, esi is actually faster on CPUs without mov-elimination for reg,reg moves (zero latency and no execution port, handled by register renaming). This includes Intel pre-IvyBridge, but not more recent Intel and AMD CPUs.

锵的 MOV IMM / 方法将延迟只有一个周期 POS 到关键路径(超出bitset->结果延迟),而不是两个了 MOV ECX,ESI / 不ECX 上的CPU,其中 MOV R,R 有1C延迟。

Clang's mov imm / sub method puts only one cycle of latency for pos onto the critical path (beyond the bitset->result latency), instead of two for a mov ecx, esi / not ecx on CPUs where mov r,r has 1c latency.

随着BMI2 (Haswell的和更高版本),最佳的ASM版本可以节省 MOV ECX 。其他一切工作一样,因为 shlx 掩盖其移位计数输入寄存器到操作数大小,就像 SHL

With BMI2 (Haswell and later), an optimal ASM version can save a mov to ecx. Everything else works the same, because shlx masks its shift-count input register down to the operand-size, just like shl.

86移位指令有其中如果移位计数为零,则标志不受影响疯CISC语义。所以可变计数移位指令对标志的旧值的(潜在)依赖性。 正常的x86 SHL R,CL 德codeS上的Haswell 3微指令,但BMI2 shlx R,R​​,R 只有1。所以这太糟糕了 SAL -march = Haswell的的,而不是GCC仍然发出使用 shlx (其中它在其他一些情况下使用)。

x86 shift instructions have crazy CISC semantics where if the shift count is zero, the flags aren't affected. So variable-count shift instructions have a (potential) dependency on the old value of the flags. "Normal" x86 shl r, cl decodes to 3 uops on Haswell, but BMI2 shlx r, r, r is only 1. So it's too bad that gcc still emits sal with -march=haswell, instead of using shlx (which it does use in some other cases).

// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
    not     esi           ; The low 6 bits hold 63-pos.  gcc's two-s complement trick
    xor     eax, eax      ; avoid false dependency on Intel.  prob. not needed when inlined.
    shlx    rdi, rdi, rsi ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi
    sar     rdi, 63       ; broadcast the sign bit: rdi=0 or -1
    and     eax, edi      ; eax = 0 or its previous value
    ret

逆足分析英特尔的Haswell:6融合域微指令(前端:每1.5C有一个)。执行单元:2 P0 / P6转移微指令。 1 P1 UOP。 2任何端口微指令:(从总执行端口限制每1.25C一个)。关键路径延迟: shlx (1) - > POPCNT (3) - > (1)= 5C bitset->的结果。 (或者从 POS 6C - >结果)。

Perf analysis for Intel Haswell: 6 fused-domain uops (frontend: one per 1.5c). Execution units: 2 p0/p6 shift uops. 1 p1 uop. 2 any-port uops: (one per 1.25c from total execution port limits). Critical path latency: shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result. (or 6c from pos->result).

这是如此轻型该循环开销和设置输入操作数/存储结果将是主要的因素。需要注意的是内联时,设置在 EAX 计数将避免为 XOR EAX,EAX 的需要。这是只有在那里打破了DEP链。 POPCNT 不能运行,直到后 63-POS 准备好了,所以有 POS EAX 将意味着<一个href=\"http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance\"><$c$c>popcnt's假DEP(英特尔)是时刻准备着。 (使用 -mtune = bdver2 什么的,GCC将不为零时注册它要使用 POPCNT 输出。 )

This is so light-weight that loop overhead and setting up the input operands / storing the results are going to be major factors. Note that when inlining, setting up the count in eax will avoid the need for the xor eax, eax. It's only there to break the dep chain. popcnt can't run until after 63-pos is ready, so having pos in eax would mean popcnt's false dep (on Intel) is always ready. (With -mtune=bdver2 or something, gcc won't zero the register it's going to use for popcnt output.)

英特尔编译器可笑拍摄自己的脚,并没有考虑这一事实,即A [63]为符号位优势。 SHL / BT RDI,63 / JC 。它甚至设立分支机构的非常愚蠢的办法。它可以零EAX,然后跳过POPCNT或不基于 SHL 设置符号标志。

The Intel compiler amusingly shoots itself in the foot and doesn't take advantage of the fact that A[63] is the sign bit. shl / bt rdi, 63 / jc. It even sets up the branches in a really dumb way. It could zero eax, and then jump over popcnt or not based on the sign flag set by shl.

的最佳分支实施,从ICC13输出上godbolt开始从 -O3 -march = corei7

An optimal branching implementation, starting from ICC13 output from -O3 -march=corei7 on godbolt:

   // hand-tuned, not compiler output
        mov       ecx, esi    ; ICC uses neg/add/mov :/
        not       ecx
        xor       eax, eax    ; breaks the false dep, or is the return value in the taken-branch case
        shl       rdi, cl
        jns    .bit_not_set
        popcnt    rax, rdi
.bit_not_set:
        ret

这是pretty很多优化: A [POS] ==真情况下具有一个不采取分支。它不保存非常超过网点的方法,虽然。

That's pretty much optimal: The A[pos] == true case has one not-taken branch. It doesn't save very much over the branchless method, though.

如果在 A [POS] ==虚假的情况是比较常见的:跳过一个 RET 指令,以一个 POPCNT / RET 。 (或者内联后:跳转到一个块在那做了 POPCNT 和跳回结束)

If the A[pos] == false case is more common: jump over a ret instruction, to a popcnt / ret. (Or after inlining: jump to a block at the end that does the popcnt and jumps back).

这篇关于什么是有效的方法来计算一个位置或下置位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆