快速计数在__m128i寄存器组的比特数 [英] Fast counting the number of set bits in __m128i register

查看:512
本文介绍了快速计数在__m128i寄存器组的比特数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该算一个__m128i寄存器的设置的位数。
特别是,我应该写两个函数都能够计数寄存器的比特数,用以下方式

I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways.


  1. 寄存器的组的比特的总数。

  2. 设置位为寄存器的每个字节数。

有没有可以执行全部或部分内部函数,上述操作?

Are there intrinsic functions that can perform, wholly or partially, the above operations?

推荐答案

下面是一些codeS我的老项目(<使用A HREF =htt​​ps://dl.acm.org/citation.cfm? ID = 2389097>有关于它)的研究论文。功能 popcnt8 以下计算中的每个字节设置的位数。

Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below computes the number of bits set in each byte.

只有SSE2版本(在黑客的喜悦书基于算法3):

SSE2-only version (based on Algorithm 3 in Hacker's Delight book):

static const __m128i popcount_mask1 = _mm_set1_epi8(0x77);
static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F);
static inline __m128i popcnt8(__m128i x) {
    __m128i n;
    // Count bits in each 4-bit field.
    n = _mm_srli_epi64(x, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    n = _mm_srli_epi64(n, 1);
    n = _mm_and_si128(popcount_mask1, n);
    x = _mm_sub_epi8(x, n);
    x = _mm_add_epi8(x, _mm_srli_epi16(x, 4));
    x = _mm_and_si128(popcount_mask2, x);
    return x;
}

SSSE3版本(由于沃伊切赫穆拉):

static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
static inline __m128i popcnt8(__m128i n) {
    const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(n, popcount_mask));
    const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(n, 4), popcount_mask));
    return _mm_add_epi8(pcnt0, pcnt1);
}

XOP版本(相当于SSSE3,但使用XOP指令,速度更快的AMD推土机)

XOP version (equivalent to SSSE3, but uses XOP instructions which are faster on AMD Bulldozer)

static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
static const __m128i popcount_shift = _mm_set1_epi8(-4);
static inline __m128i popcount8(__m128i n) {
    const __m128i pcnt0 = _mm_perm_epi8(popcount_table, popcount_table, _mm_and_si128(n, popcount_mask));
    const __m128i pcnt1 = _mm_perm_epi8(popcount_table, popcount_table, _mm_shl_epi8(n, popcount_shift));
    return _mm_add_epi8(pcnt0, pcnt1);
}

功能 popcnt64 位中的低和高64位的零部件数量SSE寄存器下方计数

SSE2版本:

SSE2 version:

static inline __m128i popcnt64(__m128i n) {
    const __m128i cnt8 = popcnt8(n);
    return _mm_sad_epu8(cnt8, _mm_setzero_si128());
}

XOP版本:

static inline __m128i popcnt64(__m128i n) {
    const __m128i cnt8 = popcnt8(n);
    return _mm_haddq_epi8(cnt8);
}

最后,函数 popcnt128 计数低于整个128位寄存器的位数:

Finally, the function popcnt128 below count the number of bits in the whole 128-bit register:

static inline int popcnt128(__m128i n) {
    const __m128i cnt64 = popcnt64(n);
    const __m128i cnt64_hi = _mm_unpackhi_epi64(cnt64, cnt64);
    const __m128i cnt128 = _mm_add_epi32(cnt64, cnt64_hi);
    return _mm_cvtsi128_si32(cnt128);
}

不过,更有效的方式来实施 popcnt128 是使用硬件POPCNT指令(上支持它的处理器):

However, a more efficient way to implement popcnt128 is to use hardware POPCNT instruction (on processors which support it):

static inline int popcnt128(__m128i n) {
    const __m128i n_hi = _mm_unpackhi_epi64(n, n);
    #ifdef _MSC_VER
        return __popcnt64(_mm_cvtsi128_si64(n)) + __popcnt64(_mm_cvtsi128_si64(n_hi));
    #else
        return __popcntq(_mm_cvtsi128_si64(n)) + __popcntq(_mm_cvtsi128_si64(n_hi));
    #endif
}

这篇关于快速计数在__m128i寄存器组的比特数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆