如何使用AVX2向量化a [i] = a [i-1] + c [英] how to vectorize a[i] = a[i-1] +c with AVX2

查看:171
本文介绍了如何使用AVX2向量化a [i] = a [i-1] + c的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过AVX2指令向量化a[i] = a[i-1] +c.由于依赖关系,它似乎无法向量化.我已经向量化了,想在这里分享答案,看看这个问题是否有更好的答案,或者我的解决方案是好的.

I want to vectorize a[i] = a[i-1] +c by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good.

推荐答案

我已经实现了以下功能,可以将其向量化,看来还可以!加速比gcc -O3高2.5倍 解决方法如下:

I have implemented the following function for vectorizing this and it seems OK! The speedup is 2.5x over gcc -O3 Here is the solution:

// vectorized
inline void vec(int a[LEN], int b, int c)
{
    // b=1 and c=2 in this case
    int i = 0;
    a[i++] = b;//0 --> a[0] = 1
    //step 1:
    //solving dependencies vectorization factor is 8
    a[i++] = a[0] + 1*c; //1  --> a[1] = 1 + 2  = 3
    a[i++] = a[0] + 2*c; //2  --> a[2] = 1 + 4  = 5
    a[i++] = a[0] + 3*c; //3  --> a[3] = 1 + 6  = 7
    a[i++] = a[0] + 4*c; //4  --> a[4] = 1 + 8  = 9
    a[i++] = a[0] + 5*c; //5  --> a[5] = 1 + 10 = 11
    a[i++] = a[0] + 6*c; //6  --> a[6] = 1 + 12 = 13
    a[i++] = a[0] + 7*c; //7  --> a[7] = 1 + 14 = 15
    // vectorization factor reached
    // 8 *c will work for all 
    //loading the results to an vector
    __m256i dep1, dep2; //  dep = { 1,   3,  5, 7,  9,  11, 13, 15 }
    __m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }

    for(; i<LEN-1; i+=16){

        dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
        dep1 = _mm256_add_epi32(dep1, coeff);
        _mm256_store_si256((__m256i *) &a[i], dep1);    

        dep2 = _mm256_load_si256((__m256i *) &a[i]);
        dep2 = _mm256_add_epi32(dep2, coeff);
        _mm256_store_si256((__m256i *) &a[i+8], dep2);  

    }
}

这篇关于如何使用AVX2向量化a [i] = a [i-1] + c的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆