性能与SSE是一样的 [英] Performance with SSE is the same
问题描述
我在一个正在开发的应用程序中实现了向量化的下列循环:
I vectorized the following loop, that crops up in an application that I am developing:
void vecScl(Node** A, Node* B, long val){
int fact = round( dot / const);
for(i=0; i<SIZE ;i++)
(*A)->vector[i] -= fact * B->vector[i];
}
这是SSE代码:
void vecSclSSE(Node** A, Node* B, long val){
int fact = round( dot / const);
__m128i vecPi, vecQi, vecCi, vecQCi, vecResi;
int sseBound = SIZE/4;
for(i=0,j=0; j<sseBound ; i+=4,j++){
vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] );
vecQi = _mm_set_epi32(fact,fact,fact,fact);
vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] );
vecQCi = _mm_mullo_epi32(vecQi,vecCi);
vecResi = _mm_sub_epi32(vecPi,vecQCi);
_mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi );
}
//Compute remaining positions if SIZE % 4 != 0
for(; i<SIZE ;i++)
(*A)->vector[i] -= q * B->vector[i];
}
虽然这在正确性方面有效,同样有和没有SSE。我正在编译代码:
While this works in terms of correctness, the performance is exactly the same with and without SSE. I am compiling the code with:
g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2
我不分配(和使用SSE功能相应)对齐的内存?代码是非常复杂的更改,所以我现在避开了。
Is this because I am not allocating (and use the SSE functions accordingly) aligned memory? The code is very complicated to change, so I was kind of avoiding that for now.
BTW,在进一步的改进,考虑到我有限的桑迪桥梁架构,我能做什么是最好的?
BTW, in terms of further improvements, and considering that I am bounded to the Sandy Bridge architecture, what is the best that I can do?
编辑:编译器没有向量化代码。首先,我将向量的数据类型更改为 short
s,这不会改变性能。然后,我编译了 -fno-tree-vectorize
并且性能是一样的。
The compiler is not vectorizing the code yet. First, I changed the data types of the vectors to short
s, which doesn't change performance. Then, I compiled with -fno-tree-vectorize
and the performance is the same.
/ p>
Thanks a lot
推荐答案
如果你的数据很大,你可能只是内存限制,因为你每个加载/存储
If your data is large then you may just be memory-bound, since you are doing very few ALU operations per load/store.
但您可以尝试一些小改进:
However there are a few minor improvements you can try:
inline void vecSclSSE(Node** A, Node* B, long val){
// make function inline, for cases where `val` is small
const int fact = (dot + const / 2 - 1) / const;
// use integer arithmetic here if possible
const __m128i vecQi = _mm_set1_epi32(fact);
// hoist constant initialisation out of loop
int32_t * const pA = (*A)->vector; // hoist invariant de-references out of loop
int32_t * const pB = B->vector;
__m128i vecPi, vecCi, vecQCi, vecResi;
for(int i = 0; i < SIZE - 3; i += 4) { // use one loop variable
vecPi = _mm_loadu_si128((__m128i *)&(pA[i]));
vecCi = _mm_loadu_si128((__m128i *)&(pB[i]));
vecQCi = _mm_mullo_epi32(vecQi,vecCi);
vecResi = _mm_sub_epi32(vecPi,vecQCi);
_mm_storeu_si128((__m128i *)&(pA[i]), vecResi);
}
//Compute remaining positions if SIZE % 4 != 0
for(; i<SIZE ;i++)
pA[i] -= q * pB[i];
}
这篇关于性能与SSE是一样的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!