性能与SSE是一样的 [英] Performance with SSE is the same

查看:107
本文介绍了性能与SSE是一样的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个正在开发的应用程序中实现了向量化的下列循环:

I vectorized the following loop, that crops up in an application that I am developing:

void vecScl(Node** A, Node* B, long val){

    int fact = round( dot / const);

    for(i=0; i<SIZE ;i++)
        (*A)->vector[i] -= fact * B->vector[i];

}

这是SSE代码:

void vecSclSSE(Node** A, Node* B, long val){

    int fact = round( dot / const);

    __m128i vecPi, vecQi, vecCi, vecQCi, vecResi;

    int sseBound = SIZE/4;

    for(i=0,j=0;  j<sseBound  ; i+=4,j++){

        vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] );
        vecQi = _mm_set_epi32(fact,fact,fact,fact);
        vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] );
        vecQCi = _mm_mullo_epi32(vecQi,vecCi);
        vecResi = _mm_sub_epi32(vecPi,vecQCi);               
        _mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi );

    }

    //Compute remaining positions if SIZE % 4 != 0 
    for(; i<SIZE ;i++)
        (*A)->vector[i] -= q * B->vector[i];

}

虽然这在正确性方面有效,同样有和没有SSE。我正在编译代码:

While this works in terms of correctness, the performance is exactly the same with and without SSE. I am compiling the code with:

 g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2

我不分配(和使用SSE功能相应)对齐的内存?代码是非常复杂的更改,所以我现在避开了。

Is this because I am not allocating (and use the SSE functions accordingly) aligned memory? The code is very complicated to change, so I was kind of avoiding that for now.

BTW,在进一步的改进,考虑到我有限的桑迪桥梁架构,我能做什么是最好的?

BTW, in terms of further improvements, and considering that I am bounded to the Sandy Bridge architecture, what is the best that I can do?

编辑:编译器没有向量化代码。首先,我将向量的数据类型更改为 short s,这不会改变性能。然后,我编译了 -fno-tree-vectorize 并且性能是一样的。

The compiler is not vectorizing the code yet. First, I changed the data types of the vectors to shorts, which doesn't change performance. Then, I compiled with -fno-tree-vectorize and the performance is the same.

/ p>

Thanks a lot

推荐答案

如果你的数据很大,你可能只是内存限制,因为你每个加载/存储

If your data is large then you may just be memory-bound, since you are doing very few ALU operations per load/store.

但您可以尝试一些小改进:

However there are a few minor improvements you can try:

inline void vecSclSSE(Node** A, Node* B, long val){
                                            // make function inline, for cases where `val` is small

    const int fact = (dot + const / 2 - 1) / const;
                                            // use integer arithmetic here if possible

    const __m128i vecQi = _mm_set1_epi32(fact);
                                            // hoist constant initialisation out of loop

    int32_t * const pA = (*A)->vector;      // hoist invariant de-references out of loop
    int32_t * const pB = B->vector;

    __m128i vecPi, vecCi, vecQCi, vecResi;

    for(int i = 0; i < SIZE - 3; i += 4) {   // use one loop variable
        vecPi = _mm_loadu_si128((__m128i *)&(pA[i]));
        vecCi = _mm_loadu_si128((__m128i *)&(pB[i]));
        vecQCi = _mm_mullo_epi32(vecQi,vecCi);
        vecResi = _mm_sub_epi32(vecPi,vecQCi);
        _mm_storeu_si128((__m128i *)&(pA[i]), vecResi);
    }

    //Compute remaining positions if SIZE % 4 != 0
    for(; i<SIZE ;i++)
        pA[i] -= q * pB[i];

}

这篇关于性能与SSE是一样的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆