在大数组中有效地找到最低有效设置位? [英] Efficiently find least significant set bit in a large array?

查看:33
本文介绍了在大数组中有效地找到最低有效设置位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的内存块(位向量),在一个内存页面内有 N 位大小,考虑 N 平均为 5000,即 5k 位来存储一些标志信息.
在某个时间点(超频繁 - 关键),我需要在整个大位向量中找到第一个位集.现在我按 64 个字执行,即在 __builtin_ctzll 的帮助下).但是当 N 增长而搜索算法无法改进时,可以通过扩展内存访问宽度来扩展此搜索.这是几句话的主要问题

I have a huge memory block (bit-vector) with size N bits within one memory page, consider N on average is 5000, i.e. 5k bits to store some flags information.
At a certain points in time (super-frequent - critical) I need to find the first bit set in this whole big bit-vector. Now I do it per-64-word, i.e. with help of __builtin_ctzll). But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few words

有一个名为 BSF 的汇编指令,它给出最高设置位的位置(GCC 的 __builtin_ctzll()).所以在 拱我可以在 64 位字中找到最便宜的最高位.

There is a single assembly instruction called BSF that gives the position of the highest set bit (GCC's __builtin_ctzll()). So in x86-64 arch I can find the highest bit set cheaply in 64-bit words.

但是通过内存宽度进行缩放呢?
例如.有没有办法用 128/256/512 位寄存器有效地做到这一点?
基本上我对一些C API函数来实现这个感兴趣,但也想知道这个方法是基于什么的.

But what about scaling through memory width?
E.g. is there a way to do it efficiently with 128 / 256 / 512 -bit registers?
Basically I'm interested in some C API function to achieve this, but also want to know what this method is based on.

UPD:至于 CPU,我对这种优化感兴趣,以支持以下 CPU 阵容:
英特尔至强 E3-12XX、英特尔至强 E5-22XX/26XX/E56XX、英特尔酷睿 i3-5XX/4XXX/8XXX、英特尔酷睿 i5-7XX、英特尔赛扬 G18XX/G49XX(英特尔凌动 N2600、英特尔赛扬 N2807、Cortex-A53/72)

UPD: As for CPU, I'm interested for this optimization to support the following CPU lineups:
Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807, Cortex-A53/72)

PS 在最终位扫描之前提到的算法中,我需要将 k(平均 20-40)N 位向量与CPU AND(AND 结果只是位扫描的准备阶段).这也适用于内存宽度缩放(即比每 64 位字 AND 更有效)

P.S. In mentioned algorithm before the final bit scan I need to sum k (in average 20-40) N-bit vectors with CPU AND (the AND result is just a preparatory stage for the bit-scan). This is also desirable to do with memory width scaling (i.e. more efficiently than per 64bit-word AND)

另请阅读:查找第一组

推荐答案

在整个向量 (AFAIK) 中找到第一个设置位的最佳方法是找到第一个非零 SIMD 元素(例如一个字节或双字),然后使用位扫描.(__builtin_ctz/bsf/tzcnt/ffs-1) .因此, ctz(vector) 本身并不是用于搜索数组的有用构建块,仅用于循环之后.

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.

相反,您希望循环搜索非零向量,使用涉及 SSE4.1 ptest xmm0,xm​​m0/ 的全向量检查>jz .loop (3 uops),或使用 SSE2 pcmpeqd v, zero/pmovmskb/cmp eax, 0xffff/je .loop(cmp/jcc 宏融合后的 3 个 uops).https://uops.info/

Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/

一旦你找到一个非零向量,pcmpeqb/movmskps/bsf 在那个上找到一个双字索引,然后加载该双字和 bsf 它.将起始位位置 (CHAR_BIT*4*dword_idx) 添加到该元素内的 bsf 位位置.这是一个相当长的延迟依赖链,包括整数 L1d 加载延迟.但是由于您刚刚加载了向量,至少您可以相当确信当您再次使用整数加载它时会命中缓存.(如果向量是动态生成的,那么可能仍然最好存储/重新加载它并让存储转发工作,而不是尝试为 vpermilps/movd 或 SSSE3 pshufb/movd/movzx ecx, al.)

Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)

循环问题与strlenmemchr 非常相似,除了我们拒绝单个值(0)并寻找任何东西其他.尽管如此,我们还是可以从手动优化的 asm strlen/memchr 实现(如 glibc 的实现)中汲取灵感,例如加载多个向量并进行一次检查以查看它们中的任何是否具有所需的内容.(对于 strlen,如果任何元素为 0,则与 pminub 结合以获得 0.对于 pcmpeqb 比较结果,或对于 memchr).对于我们的目的,我们想要的归约运算是 OR - 任何非零输入都会使输出非零,并且按位布尔运算可以在任何向量 ALU 端口上运行.

The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.

(如果预期的第一位位置不是非常,那么不值得过于积极地处理这个:如果第一个设置位在第一个向量,在你加载的 2 个向量之间排序会更慢.5000 位只有 625 个字节,或 19.5 个 AVX2 __m256i 向量.第一个设置位可能并不总是在最后)

(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)

这会检查成对的 32 字节向量(即整个缓存行)是否为非零,如果找到,则将其分类为一个 64 位位图,用于单个 CTZ 操作.额外的移位/OR 会导致关键路径中的延迟,但我们希望我们能早点到达第一个 1 位.

This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.

用 OR 将 2 个向量合并为一个意味着知道 OR 结果的哪个元素不是零并不是很有用.我们基本上重做 if 里面的工作.这就是我们为将实际搜索部分的 uop 数量保持在较低水平而付出的代价.

Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.

(if 主体以 return 结尾,所以在 asm 中它实际上就像一个 if()break,或者实际上是一个 if()goto 退出循环,因为它转到与 not-found 不同的地方,从循环中掉下来返回 -1.)

(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)

// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes

#include <immintrin.h>
#include <stdint.h>
#include <string.h>

// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
    //assert(len % 64 == 0);
    //optimal if p is 64-byte aligned, so we're checking single cache-lines
    const char *p_init = p;
    const char *endp = p + len - 64;
    do {
        __m256i v1 = _mm256_loadu_si256((const __m256i*)p);
        __m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
        __m256i or = _mm256_or_si256(v1,v2);
        if (!_mm256_testz_si256(or, or)){        // find the first non-zero cache line
            __m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
            __m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
            uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
            zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;

            unsigned idx = __builtin_ctz(~zero_map);  // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
            uint32_t nonzero_chunk;
            memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk));  // aliasing / alignment-safe load

            return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
        }
        p += 64;
    }while(p < endp);
    return -1;
}

On Godbolt with clang 12 -O3 -march=haswell:

On Godbolt with clang 12 -O3 -march=haswell:

bitscan_avx2:
        lea     rax, [rdi + rsi]
        add     rax, -64                 # endp
        xor     ecx, ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm1, ymmword ptr [rdi]  # do {
        vmovdqu ymm0, ymmword ptr [rdi + 32]
        vpor    ymm2, ymm0, ymm1
        vptest  ymm2, ymm2
        jne     .LBB0_2                       # if() goto out of the inner loop
        add     ecx, 512                      # bit-counter incremented in the loop, for (p-p_init) * 8
        add     rdi, 64
        cmp     rdi, rax
        jb      .LBB0_1                  # }while(p<endp)

        mov     eax, -1               # not-found return path
        vzeroupper
        ret

.LBB0_2:
        vpxor   xmm2, xmm2, xmm2
        vpcmpeqd        ymm1, ymm1, ymm2
        vmovmskps       eax, ymm1
        vpcmpeqd        ymm0, ymm0, ymm2
        vmovmskps       edx, ymm0
        shl     edx, 8
        or      edx, eax             # mov ah,dl  would be interesting, but compilers won't do it.
        not     edx                  # one_positions = ~zero_positions
        xor     eax, eax                # break false dependency
        tzcnt   eax, edx             # dword_idx
        xor     edx, edx
        tzcnt   edx, dword ptr [rdi + 4*rax]   # p[dword_idx]
        shl     eax, 5               # dword_idx * 4 * CHAR_BIT
        add     eax, edx
        add     eax, ecx
        vzeroupper
        ret

这可能不是所有 CPU 的最佳选择,例如也许我们可以为至少一个输入使用内存源 vpcmpeqd,并且不需要任何额外的前端 uops,只需要后端.只要编译器继续使用指针增量,而不是索引寻址模式,不会分层.这将减少分支之后所需的工作量(这可能是错误预测).

This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).

要仍然使用 vptest,您可能必须利用 CF = (~dst & src == 0) 对向量的操作的 CF 结果全部为 1,因此我们可以检查所有元素是否匹配(即输入全为零).不幸的是,是否可以使用 PTEST 来测试两个寄存器是否都为零或其他条件? - 不,我认为如果没有 vpor<,我们不能有效地使用 vptest/代码>.

To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.

Clang 决定在循环之后实际上不减去指针,而是在搜索循环中做更多的工作.:/循环是 9 uops(在 cmp/jb 的宏融合之后),所以不幸的是它每 2 个周期只能运行少于 1 次迭代.所以它只管理不到 L1d 缓存带宽的一半.

Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.

但显然单个数组并不是你真正的问题.

But apparently a single array isn't your real problem.

16 字节向量意味着我们不必处理in-lane"AVX2 shuffle 的行为.因此,我们可以与 packssdwpacksswb 结合使用,而不是 OR.包输入高半部分中的任何设置位将使结果符号饱和为 0x80 或 0x7f.(所以有符号饱和是关键,而不是 无符号 packuswb 会饱和0 的有符号负输入.)

16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)

但是,shuffle 仅在 Intel CPU 的端口 5 上运行,因此请注意吞吐量限制.例如,Skylake 上的 ptest 是 2 uops,p5 和 p0,因此使用 packsswb + ptest + jz 会限制每 2 个时钟迭代一次.但是 pcmpeqd + pmovmskb 没有.

However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.

不幸的是,在每个输入上单独使用 pcmpeq 之前 打包/组合会花费更多的 uops.但会减少清理工作的剩余量,如果循环退出通常涉及分支预测错误,则可能会减少整体延迟.

Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.

2x pcmpeqd =>packssdw =>pmovmskb =>不是 =>bsf 会给你一个数字,你必须乘以 2 才能用作字节偏移量才能得到非零双字.例如memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));.即 bsf eax, [rdi + rdx*2].

2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].

您提到了 512 位向量,但您列出的 CPU 均不支持 AVX-512.即使是这样,您也可能希望避免使用 512 位向量,因为SIMD 指令降低 CPU 频率,除非您的程序花费 很多时间这样做,并且您的数据在 L1d 缓存中很热,因此您可以真正受益,而不是仍然在 L2 缓存带宽上遇到瓶颈.但即使使用 256 位向量,AVX-512 也有对此有用的新指令:

You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:

  • 整数比较(vpcmpb/w/d/q) 可以选择谓词,因此您可以不等于,而不必稍后用 NOT 反转.甚至测试注册vptestmd 所以你不需要一个归零的向量来比较.
  • compare-into-mask 有点像 pcmpeq + movmsk,除了结果在 k 寄存器中之外,还需要一个 kmovq rax, k0 才能<代码>tzcnt.
  • kortest - 设置 FLAGS根据两个屏蔽寄存器的 OR 非零.所以搜索循环可以做 vpcmpd k0, ymm0, [rdi]/vpcmpd k1, ymm0, [rdi+32]/kortestw k0, k1>
  • integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.
  • compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.
  • kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1

您提到您的真正问题是您有多达 20 个位数组,并且您想将它们与 AND 相交并在交集中找到第一个设置位.

You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.

您可能希望在几个向量的块中执行此操作,乐观地希望早点在某处设置位.

You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.

AND 组的 4 或 8 个输入,使用 OR 对结果进行累加,因此您可以判断此块中是否有来自每个输入的 4 个向量的 1.(如果没有任何 1 位,则在仍然加载指针的同时执行另一个 4 个向量块,64 或 128 字节,因为如果您现在转移到其他输入,交集肯定是空的).调整这些块大小取决于您的 1 的稀疏程度,例如可能总是在 6 或 8 个向量的块中工作.不过,2 的幂数很好,因为您可以将分配填充到 64 或 128 字节的倍数,因此您不必担心提前停止.)

AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)

(对于奇数个输入,可以将相同的指针两次传递给需要 4 个输入的函数,而不是为每个可能的数字分派到循环的特殊版本.)

(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)

L1d 缓存是 8 路关联的(在 Ice Lake 之前是 12 路),有限数量的整数/指针寄存器可能会使尝试一次读取太多流成为一个坏主意.您可能不想要一个间接级别,使编译器在指针内存中的实际数组上循环.

L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

这篇关于在大数组中有效地找到最低有效设置位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆