使用AVX-512或AVX-2对大数据计数1位(填充计数) [英] Counting 1 bits (population count) on large data using AVX-512 or AVX-2

查看:173
本文介绍了使用AVX-512或AVX-2对大数据计数1位(填充计数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很长的内存,例如256 KiB或更长时间.我想计算整个块中1位的数目,或者换句话说:将所有字节的填充计数"值相加.

I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes.

我知道AVX-512具有 VPOPCNTDQ指令计算512位向量中每个连续64位中1位的数目,而IIANM应该可以在每个周期中发布一个(如果有合适的SIMD矢量寄存器)-但是我没有任何经验编写SIMD代码(我是GPU专家).另外,我不确定100%是否支持AVX-512目标的编译器.

I know that AVX-512 has a VPOPCNTDQ instruction which counts the number of 1 bits in each consecutive 64 bits within a 512-bit vector, and IIANM it should be possible to issue one of these every cycle (if an appropriate SIMD vector register is available) - but I don't have any experience writing SIMD code (I'm more of a GPU guy). Also, I'm not 100% sure about compiler support for AVX-512 targets.

在大多数CPU上,AVX-512仍然不受(完全)支持;但是AVX-2广泛可用.我找不到类似于VPOPCNTDQ的少于512位的向量化指令,因此即使从理论上讲,我也不确定如何使用具有AVX-2功能的CPU快速计数位数.也许这样的东西存在,我只是以某种方式错过了它?

On most CPUs, still, AVX-512 is not (fully) supported; but AVX-2 is widely-available. I've not been able to find an less-than-512-bit vectorized instruction similar to VPOPCNTDQ, so even theoretically I'm not sure how to count bits fast with AVX-2 capable CPUs; maybe something like this exists and I just missed it somehow?

无论如何,我会为两个指令集的每一个提供一个简短的C/C ++函数-使用一些intristics-wrapper库或使用内联汇编.签名是

Anyway, I'd appreciate a short C/C++ function - either using some intristics-wrapper library or with inline assembly - for each of the two instruction sets. The signature is

uint64_t count_bits(void* ptr, size_t size);

注意:

  • Related to How to quickly count bits into separate bins in a series of ints on Sandy Bridge? , but not a dupe.
  • We can assume the input is well-aligned, if that matters.
  • Forget about multiple cores or sockets, I want code for a single (thread on a single) core.

推荐答案

AVX-2

@HadiBreis的评论链接到文章上WojciechMuła使用SSSE3进行快速人口计数;该文章链接到此GitHub存储库;并且存储库具有以下AVX- 2实施.它基于矢量化的查找指令,并使用16值查找表来对半字节的位数进行计数.

AVX-2

@HadiBreis' comment links to an article on fast population-count with SSSE3, by Wojciech Muła; the article links to this GitHub repository; and the repository has the following AVX-2 implementation. It's based on a vectorized lookup instruction, and using a 16-value lookup table for the bit counts of nibbles.

#   include <immintrin.h>
#   include <x86intrin.h>

std::uint64_t popcnt_AVX2_lookup(const uint8_t* data, const size_t n) {

    size_t i = 0;

    const __m256i lookup = _mm256_setr_epi8(
        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4,

        /* 0 */ 0, /* 1 */ 1, /* 2 */ 1, /* 3 */ 2,
        /* 4 */ 1, /* 5 */ 2, /* 6 */ 2, /* 7 */ 3,
        /* 8 */ 1, /* 9 */ 2, /* a */ 2, /* b */ 3,
        /* c */ 2, /* d */ 3, /* e */ 3, /* f */ 4
    );

    const __m256i low_mask = _mm256_set1_epi8(0x0f);

    __m256i acc = _mm256_setzero_si256();

#define ITER { \
        const __m256i vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data + i)); \
        const __m256i lo  = _mm256_and_si256(vec, low_mask); \
        const __m256i hi  = _mm256_and_si256(_mm256_srli_epi16(vec, 4), low_mask); \
        const __m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo); \
        const __m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi); \
        local = _mm256_add_epi8(local, popcnt1); \
        local = _mm256_add_epi8(local, popcnt2); \
        i += 32; \
    }

    while (i + 8*32 <= n) {
        __m256i local = _mm256_setzero_si256();
        ITER ITER ITER ITER
        ITER ITER ITER ITER
        acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));
    }

    __m256i local = _mm256_setzero_si256();

    while (i + 32 <= n) {
        ITER;
    }

    acc = _mm256_add_epi64(acc, _mm256_sad_epu8(local, _mm256_setzero_si256()));

#undef ITER

    uint64_t result = 0;

    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 0));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 1));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 2));
    result += static_cast<uint64_t>(_mm256_extract_epi64(acc, 3));

    for (/**/; i < n; i++) {
        result += lookup8bit[data[i]];
    }

    return result;
}

AVX-512

同一存储库还具有基于VPOPCNT的AVX-512实现:

AVX-512

The same repository also has a VPOPCNT-based AVX-512 implementation:

#   include <immintrin.h>
#   include <x86intrin.h>

uint64_t avx512_vpopcnt(const uint8_t* data, const size_t size) {

    const size_t chunks = size / 64;

    uint8_t* ptr = const_cast<uint8_t*>(data);
    const uint8_t* end = ptr + size;

    // count using AVX512 registers
    __m512i accumulator = _mm512_setzero_si512();
    for (size_t i=0; i < chunks; i++, ptr += 64) {

        // Note: a short chain of dependencies, likely unrolling will be needed.
        const __m512i v = _mm512_loadu_si512((const __m512i*)ptr);
        const __m512i p = _mm512_popcnt_epi64(v);

        accumulator = _mm512_add_epi64(accumulator, p);
    }

    // horizontal sum of a register
    uint64_t tmp[8] __attribute__((aligned(64)));
    _mm512_store_si512((__m512i*)tmp, accumulator);

    uint64_t total = 0;
    for (size_t i=0; i < 8; i++) {
        total += tmp[i];
    }

    // popcount the tail
    while (ptr + 8 < end) {
        total += _mm_popcnt_u64(*reinterpret_cast<const uint64_t*>(ptr));
        ptr += 8;
    }

    while (ptr < end) {
        total += lookup8bit[*ptr++];
    }

    return total;
}

lookup8bit是用于字节而不是位的popcnt查找表,并且已定义此处. 编辑:正如评论者所指出的,在末尾使用8位查找表并不是一个好主意,可以对其进行改进.

The lookup8bit is a popcnt lookup table for bytes rather than bits, and is defined here. edit: As commenters note, using an 8-bit lookup table at the end is not a very good idea and can be improved on.

这篇关于使用AVX-512或AVX-2对大数据计数1位(填充计数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆