如何计算 32 位整数中设置的位数? [英] How to count the number of set bits in a 32-bit integer?

查看:30
本文介绍了如何计算 32 位整数中设置的位数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代表数字 7 的 8 位如下所示:

8 bits representing the number 7 look like this:

00000111

设置了三个位.

确定 32 位整数中设置位数的算法是什么?

What are algorithms to determine the number of set bits in a 32-bit integer?

推荐答案

这被称为汉明权重"、popcount"或横向加法".

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

有些 CPU 有一个单一的内置指令来完成它,而另一些 CPU 则具有作用于位向量的并行指令.像 x86 的 popcnt(在支持它的 CPU 上)这样的指令几乎肯定会对于单个整数最快.其他一些架构可能有一个慢速指令,它使用微编码循环实现,每个周期测试一个位(需要引用 - 如果存在硬件 popcount 通常很快.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

最佳"算法实际上取决于您使用的 CPU 以及您的使用模式.

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

您的编译器可能知道如何为您正在编译的特定 CPU 做一些有益的事情,例如C++20 std::popcount(), 或 C++ std::bitset<32>::count(),作为访问内置/内在函数的可移植方式(参见 关于这个问题的另一个答案).但是您的编译器为没有硬件 popcnt 的目标 CPU 选择的回退可能不是您的用例的最佳选择.或者您的语言(例如 C)可能不会公开任何可以使用特定于 CPU 的 popcount 的可移植函数.

Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.

如果您的 CPU 具有较大的缓存并且您在紧密循环中执行大量此类操作,则预填充的表查找方法可能会非常快.然而,由于缓存未命中"的代价,CPU 必须从主内存中获取一些表,因此它可能会受到影响.(分别查找每个字节以保持表格较小.)如果您想要连续范围的数字的 popcount,对于 256 个数字的组,只有低字节发生变化,使这非常好.

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

如果您知道您的字节大部分是 0 或大部分是 1,那么对于这些​​场景有有效的算法,例如用循环中的 bithack 清除最低组,直到它变为零.

If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

我相信一个非常好的通用算法如下,称为并行"或可变精度 SWAR 算法".我已经用类似 C 的伪语言表达了这一点,您可能需要对其进行调整以适用于特定语言(例如,在 C++ 中使用 uint32_t,在 Java 中使用 >>>):

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

GCC10 和 clang 10.0 可以识别这种模式/习惯用法,并在可用时将其编译为硬件 popcnt 或等效指令,为您提供两全其美的优势.(https://godbolt.org/z/qGdh1dvKK)

GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

对于 JavaScript:使用 |0 强制转换为整数 以提高性能:将第一行更改为 i = (i|0) - ((i >> 1) & 0x55555555);

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

这具有所讨论的任何算法中最好的最坏情况行为,因此将有效地处理您抛出的任何使用模式或值.(它的性能不依赖于普通 CPU 的数据,其中包括乘法在内的所有整数运算都是恒定时间的.简单"输入并没有变得更快,但它仍然相当不错.)

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

参考文献:

i = i - ((i >> 1) & 0x55555555);

第一步是屏蔽的优化版本,以隔离奇数/偶数位,移位以排列它们,然后添加.这有效地在 2 位累加器(SWAR = SIMD Within A Register)中进行了 16 次单独的加法.像 (i & 0x55555555) + ((i>>1) & 0x55555555).

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

下一步取这些 16x 2 位累加器中的奇数/偶数 8 并再次相加,产生 8x 4 位和.这次 i - ... 优化是不可能的,所以它只是在移位之前/之后进行掩码.在为需要在寄存器中构造 32 位常量的 ISA 编译时,在移位之前两次使用相同的 0x33... 常量而不是 0xccc... 是一件好事分开.

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

(i + (i >> 4)) & 的最终移位和相加步骤0x0F0F0F0F 扩展为 4x 8 位累加器.它屏蔽after添加而不是之前,因为任何4位累加器中的最大值是4,如果相应输入位的所有4位都被设置.4+4 = 8 仍然适合 4 位,因此 i + (i >> 4) 中不可能在半字节元素之间进位.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

到目前为止,这只是使用 SWAR 技术并进行了一些巧妙优化的相当普通的 SIMD.继续使用相同的模式进行 2 个以上的步骤可以扩大到 2x 16 位,然后是 1x 32 位计数.但是在具有快速硬件乘法的机器上有一种更有效的方法:

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

一旦我们有足够少的元素",乘以魔法常数可以将所有元素相加到顶部元素.在这种情况下,字节元素.乘法是通过左移和加法完成的,所以 乘以 x * 0x01010101 的结果是 x + (x<<8) + (x<<16) +(x<<24). 我们的 8 位元素足够宽(并保持足够小的计数),这不会产生进位进入前 8 位.

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

一个 64 位版本可以在一个 64 位整数中使用 0x0101010101010101 乘法器处理 8x 8 位元素,并使用 >>56<提取高字节/代码>.所以它不需要任何额外的步骤,只是更广泛的常量.当硬件 popcnt 指令未启用时,这是 GCC 在 x86 系统上用于 __builtin_popcountll 的内容.如果您可以为此使用内置函数或内在函数,那么这样做是为了让编译器有机会进行特定于目标的优化.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

这种按位 SWAR 算法可以并行化以同时在多个向量元素中完成,而不是在单个整数寄存器中完成,以便在具有 SIMD 但没有可用 popcount 指令的 CPU 上加速.(例如,必须在任何 CPU 上运行的 x86-64 代码,而不仅仅是 Nehalem 或更高版本.)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

然而,使用向量指令进行 popcount 的最佳方法通常是使用变量 shuffle 在每个字节的时间并行执行 4 位的表查找.(这 4 位索引保存在向量寄存器中的 16 个条目表).

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

在 Intel CPU 上,硬件 64 位 popcnt 指令可以胜过 SSSE3 PSHUFB 位并行实现 大约 2 倍,但只有 如果你的编译器得到它对.否则上证所可能会显着领先.较新的编译器版本知道 popcnt false 依赖项 英特尔问题.

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

  • https://github.com/WojciechMula/sse-popcount state-of-the-art x86 SIMD popcount for SSSE3, AVX2, AVX512BW, AVX512VBMI, or AVX512 VPOPCNT. Using Harley-Seal across vectors to defer popcount within an element. (Also ARM NEON)
  • Counting 1 bits (population count) on large data using AVX-512 or AVX-2
  • related: https://github.com/mklarqvist/positional-popcount - separate counts for each bit-position of multiple 8, 16, 32, or 64-bit integers. (Again, x86 SIMD including AVX-512 which is really good at this, with vpternlogd making Harley-Seal very good.)

这篇关于如何计算 32 位整数中设置的位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆