如何使用 SIMD 计算字符出现次数 [英] How to count character occurrences using SIMD

查看:22
本文介绍了如何使用 SIMD 计算字符出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一组小写字符(最多 1.5Gb)和一个字符 c.我想使用 AVX 指令找出字符 c 出现了多少次.

I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions.

unsigned long long char_count_AVX2(char * vector, int size, char c){
unsigned long long sum =0;
int i, j;
const int con=3;
__m256i ans[con];
for(i=0; i<con; i++)
    ans[i]=_mm256_setzero_si256();

__m256i Zer=_mm256_setzero_si256();
__m256i C=_mm256_set1_epi8(c);
__m256i Assos=_mm256_set1_epi8(0x01);
__m256i FF=_mm256_set1_epi8(0xFF);
__m256i shield=_mm256_set1_epi8(0xFF);
__m256i temp;
int couter=0;
for(i=0; i<size; i+=32){
    couter++;
    shield=_mm256_xor_si256(_mm256_cmpeq_epi8(ans[0], Zer), FF);
    temp=_mm256_cmpeq_epi8(C, *((__m256i*)(vector+i)));
    temp=_mm256_xor_si256(temp, FF);
    temp=_mm256_add_epi8(temp, Assos);
    ans[0]=_mm256_add_epi8(temp, ans[0]);
    for(j=1; j<con; j++){
        temp=_mm256_cmpeq_epi8(ans[j-1], Zer);
        shield=_mm256_and_si256(shield, temp);
        temp=_mm256_xor_si256(shield, FF);
        temp=_mm256_add_epi8(temp, Assos);
        ans[j]=_mm256_add_epi8(temp, ans[j]);
    }
}
for(j=con-1; j>=0; j--){
    sum<<=8;
    unsigned char *ptr = (unsigned char*)&(ans[j]);
    for(i=0; i<32; i++){
        sum+=*(ptr+i);
    }
}
return sum;

}

推荐答案

我故意省略了一些你需要自己弄清楚的部分(例如处理不是 4*255 倍数的长度)*32 字节),但最内层的循环应该类似于以 for(int i...) 开头的循环:

I'm intentionally leaving out some parts, which you need to figure out yourself (e.g. handling lengths that aren't a multiple of 4*255*32 bytes), but your most inner loop should look something like the one starting with for(int i...):

_mm256_cmpeq_epi8 会在每个字节中得到 -1,您可以将其用作整数.如果从计数器中减去它(使用 _mm256_sub_epi8),您可以直接计数到 255 或 128.内部循环仅包含这两个内部函数.你必须停下来

_mm256_cmpeq_epi8 will get you a -1 in each byte, which you can use as an integer. If you subtract that from a counter (using _mm256_sub_epi8) you can directly count up to 255 or 128. The inner loop contains just these two intrinsics. You have to stop and

#include <immintrin.h>
#include <stdint.h>

static inline
__m256i hsum_epu8_epu64(__m256i v) {
    return _mm256_sad_epu8(v, _mm256_setzero_si256());  // SAD against zero is a handy trick
}

static inline
uint64_t hsum_epu64_scalar(__m256i v) {
    __m128i lo = _mm256_castsi256_si128(v);
    __m128i hi = _mm256_extracti128_si256(v, 1);
    __m128i sum2x64 = _mm_add_epi64(lo, hi);   // narrow to 128

    hi = _mm_unpackhi_epi64(sum2x64, sum2x64);
    __m128i sum = _mm_add_epi64(hi, sum2x64);  // narrow to 64
    return _mm_cvtsi128_si64(sum);
}


unsigned long long char_count_AVX2(char const* vector, size_t size, char c)
{
    __m256i C=_mm256_set1_epi8(c);

    // todo: count elements and increment `vector` until it is aligned to 256bits (=32 bytes)
    __m256i const * simd_vector = (__m256i const *) vector;
     // *simd_vector is an alignment-required load, unlike _mm256_loadu_si256()

    __m256i sum64 = _mm256_setzero_si256();
    size_t unrolled_size_limit = size - 4*255*32 + 1;
    for(size_t k=0; k<unrolled_size_limit ; k+=4*255*32) // outer loop: TODO
    {
        __m256i counter[4]; // multiple counter registers to hide latencies
        for(int j=0; j<4; j++)
            counter[j]=_mm256_setzero_si256();
        // inner loop: make sure that you don't go beyond the data you can read
        for(int i=0; i<255; ++i)
        {   // or limit this inner loop to ~22 to avoid branch mispredicts
            for(int j=0; j<4; ++j)
            {
                counter[j]=_mm256_sub_epi8(counter[j],           // count -= 0 or -1
                                           _mm256_cmpeq_epi8(*simd_vector, C));
                ++simd_vector;
            }
        }

        // only need one outer accumulator: OoO exec hides the latency of adding into it
        sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[0]));
        sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[1]));
        sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[2]));
        sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[3]));
    }

    uint64_t sum = hsum_epu64_scalar(sum64);

    // TODO add up remaining bytes with sum.
    // Including a rolled-up vector loop before going scalar
    //  because we're potentially a *long* way from the end

    // Maybe put some logic into the main loop to shorten the 255 inner iterations
    // if we're close to the end.  A little bit of scalar work there shouldn't hurt every 255 iters.

    return sum;
}

Godbolt 链接:https://godbolt.org/z/do5e3-(clang 是在展开最内层循环方面比 gcc 略好:gcc 包含一些无用的 vmovdqa 指令,如果数据在 L1d 缓存中很热,这些指令将限制前端,阻止我们运行接近 2x 32 字节每个时钟负载)

Godbolt link: https://godbolt.org/z/do5e3- (clang is slightly better than gcc at unrolling the most inner loop: gcc includes some useless vmovdqa instructions that will bottleneck the front-end if the data is hot in L1d cache, preventing us from running close to 2x 32-byte loads per clock)

这篇关于如何使用 SIMD 计算字符出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆