SSE的并行前缀(累加)总和 [英] parallel prefix (cumulative) sum with SSE

查看:148
本文介绍了SSE的并行前缀(累加)总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关如何使用SSE进行并行前缀求和的一些建议.我对在整数,浮点数或双精度数组上进行此操作很感兴趣.

I'm looking for some advice on how to do a parallel prefix sum with SSE. I'm interested in doing this on an array of ints, floats, or doubles.

我提出了两种解决方案.特殊情况和一般情况.在这两种情况下,解决方案都需要与OpenMP并行进行两次遍历.对于特殊情况,我在两次通过时都使用SSE.对于一般情况,我仅在第二遍使用它.

I have come up with two solutions. A special case and a general case. In both cases the solution runs over the array in two passes in parallel with OpenMP. For the special case I use SSE on both passes. For the general case I use it only on the second pass.

我的主要问题是在一般情况下如何在第一次通过时使用SSE?以下链接

My main question is how I can use SSE on the first pass in the general case? The following link simd-prefix-sum-on-intel-cpu show an improvement for bytes but not for 32bit data types.

将特殊情况称为特殊情况的原因是,它要求数组采用特殊格式.例如,假设浮点数组a中只有16个元素.然后,如果数组是这样重新排列的(从结构数组到数组结构):

The reason the special case is called special is that it requires the array be in a special format. For example let's assume there were only 16 elements of an arrayaof floats. Then if the array was rearranged like this (array of structs to struct of arrays):

a[0] a[1] ...a[15] -> a[0] a[4] a[8] a[12] a[1] a[5] a[9] a[13]...a[3] a[7] a[11] a[15]

SSE垂直总和可以在两个遍中使用.但是,这只有在数组已经采用特殊格式并且输出可以采用特殊格式的情况下才有效.否则,必须在输入和输出上都进行昂贵的重新排列,这将使其比一般情况下慢得多.

SSE vertical sums could be used on both passes. However, this would only be efficient if the arrays were already in the special format and the output could be used in the special format. Otherwise expensive rearrange would have to be done on both input and output which would make it much slower than the general case.

也许我应该考虑使用不同的前缀总和算法(例如,二叉树)?

Maybe I should consider a different algorithm for the prefix sum (e.g. a binary tree)?

一般情况的代码:

void prefix_sum_omp_sse(double a[], double s[], int n) {
    double *suma;
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();
        #pragma omp single
        {
            suma = new double[nthreads + 1];
            suma[0] = 0;
        }
        double sum = 0;
        #pragma omp for schedule(static) nowait //first parallel pass
        for (int i = 0; i<n; i++) {
            sum += a[i];
            s[i] = sum;
        }
        suma[ithread + 1] = sum;
        #pragma omp barrier
        #pragma omp single
        {
            double tmp = 0;
            for (int i = 0; i<(nthreads + 1); i++) {
                tmp += suma[i];
                suma[i] = tmp;
            }
        }
        __m128d offset = _mm_set1_pd(suma[ithread]);
        #pragma omp for schedule(static) //second parallel pass with SSE as well
        for (int i = 0; i<n/4; i++) {       
            __m128d tmp1 = _mm_load_pd(&s[4*i]);
            tmp1 = _mm_add_pd(tmp1, offset);    
            __m128d tmp2 = _mm_load_pd(&s[4*i+2]);
            tmp2 = _mm_add_pd(tmp2, offset);
            _mm_store_pd(&s[4*i], tmp1);
            _mm_store_pd(&s[4*i+2], tmp2);
        }
    }
    delete[] suma;
}

推荐答案

这是我第一次回答自己的问题,但这似乎是适当的.基于hirschhornsalz 在16个字节上的前缀总和的答案 simd-prefix-sum-on-intel-cpu 我想出了一个在4、8和16个32位字的第一遍中使用SIMD的解决方案.

This is the first time I'm answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words.

一般理论如下.对于n个单词的顺序扫描,需要进行n个加法(n-1个用于扫描n个单词,而前一个单词集又携带一个加法).但是,使用SIMD可以对n个单词进行log 2 (n)加法运算,并按相等的移位数再加上一个加法运算,然后广播以从先前的SIMD扫描中进行.因此对于n的某些值,SIMD方法将获胜.

The general theory goes as follows. For a sequential scan of n words it takes n additions (n-1 to scan the n words and one more addition carried from the previous set of words scanned). However using SIMD n words can be scanned in log2(n) additions and an equal number of shifts plus one more addition and broadcast to carry from the previous SIMD scan. So for some value of n the SIMD method will win.

让我们看一下使用SSE,AVX和AVX-512的32位单词:

Let's look at 32-bit words with SSE, AVX, and AVX-512:

4 32-bit words (SSE):      2 shifts, 3 adds, 1 broadcast       sequential: 4 adds
8 32-bit words (AVX):      3 shifts, 4 adds, 1 broadcast       sequential: 8 adds
16 32 bit-words (AVX-512): 4 shifts, 5 adds, 1 broadcast       sequential: 16 adds

基于这一点,直到AVX-512出现SIMD才对扫描32位字有用.这也假定移位和广播只能在1条指令中完成.这对于SSE是正确的,但不适用于AVX,甚至可能不适用于AVX2 .

Based on that it appears SIMD won't be useful for a scan for 32-bit words until AVX-512. This also assumes that the shifts and broadcast can be done in only 1 instruction. This is true for SSE but not for AVX and maybe not even for AVX2.

无论如何,我将一些经过工作和测试的代码放在一起,它们使用SSE进行前缀求和.

In any case I put together some working and tested code which does a prefix sum using SSE.

inline __m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
    return x;
}

void prefix_sum_SSE(float *a, float *s, const int n) {
__m128 offset = _mm_setzero_ps();
for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_load_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    _mm_store_ps(&s[i], out);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3)); 
}

请注意,scan_SSE函数具有两个加法(_mm_add_ps)和两个移位(_mm_slli_si128).强制转换仅用于使编译器满意,并且不会转换为指令.然后在主循环中,在prefix_sum_SSE中的数组上进行另一次加法,并使用一次混洗.总共有6个操作,而只有4个加法运算具有连续的总和.

Notice that the scan_SSE function has two additions (_mm_add_ps) and two shifts (_mm_slli_si128). The casts are only used to make the compiler happy and don't get converted to instructions. Then inside the main loop over the array in prefix_sum_SSE another addition and one shuffle is used. That's 6 operations total compared to only 4 additions with the sequential sum.

这是AVX的有效解决方案:

Here is a working solution for AVX:

inline __m256 scan_AVX(__m256 x) {
    __m256 t0, t1;
    //shift1_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x11));
    //shift2_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x33));
    //shift3_AVX + add
    x = _mm256_add_ps(x,_mm256_permute2f128_ps(x, x, 41));
    return x;
}

void prefix_sum_AVX(float *a, float *s, const int n) {
    __m256 offset = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 x = _mm256_loadu_ps(&a[i]);
        __m256 out = scan_AVX(x);
        out = _mm256_add_ps(out, offset);
        _mm256_storeu_ps(&s[i], out);
        //broadcast last element
        __m256 t0 = _mm256_permute2f128_ps(out, out, 0x11);
        offset = _mm256_permute_ps(t0, 0xff);
    }   
}

三个移位需要7个内在函数.广播需要2个内在函数.因此,有4个加法项就是13个本征函数.对于AVX2,仅需要5个内在函数进行转换,因此总共需要11个内在函数.顺序和仅需要8个加法.因此,AVX和AVX2都不会对第一次通过有用.

The three shifts need 7 intrinsics. The broadcast needs 2 intrinsics. So with the 4 additions that's 13 intrinisics. For AVX2 only 5 intrinsics are needed for the shifts so 11 intrinsics total. The sequential sum only needs 8 additions. Therefore likely neither AVX nor AVX2 will be useful for the first pass.

因此,我最终对此进行了基准测试,结果出乎意料. SSE和AVX代码的速度大约是以下顺序代码的两倍:

So I finally benchmarked this and the results are unexpected. The SSE and AVX code are both about twice as fast as the following sequential code:

void scan(float a[], float s[], int n) {
    float sum = 0;
    for (int i = 0; i<n; i++) {
        sum += a[i];
        s[i] = sum;
    }
}

我想这是由于指令级并行性造成的.

I guess this is due to instruction level paralellism.

这样可以回答我自己的问题.在一般情况下,我成功地将SIMD用于pass1.当我在4核心常春藤桥系统上将它与OpenMP结合使用时,对于512k浮点运算,总速度大约为7.

这篇关于SSE的并行前缀(累加)总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆