使用sse2内部函数进行循环展开 [英] loop unrolling using sse2 intrinsics
本文介绍了使用sse2内部函数进行循环展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的sse2代码又长又慢,如何快速? _mm_store_si128()失败,但是_mm_storeu_si128()被接受,为什么?
My sse2 code is long and slow, how can I make it fast? _mm_store_si128() failed but _mm_storeu_si128()accepted, why?
void tom::add(void* ptr)
{
__declspec(align(16))short* b =(short*)ptr;
int j;
#if cplusplus
for(j = 0; j < 4; j++)
{
/// 1st stage transform.
int x0 = (int)(b[j] + b[j+12]);
int x3 = (int)(b[j] - b[j+12]);
int x1 = (int)(b[j+4] + b[j+8]);
int x2 = (int)(b[j+4] - b[j+8]);
/// 2nd stage transform.
b[j] = (short)(x0 + x1);
b[j+8] = (short)(x0 - x1);
b[j+4] = (short)(x2 + (x3 << 1));
b[j+12] = (short)(x3 - (x2 << 1));
}//end for j...
#else
__m128i f0,f1,f2,f3;
j=0;
f0 = _mm_set_epi32(b[j+3],b[j+2],b[j+1],b[j]);
f1 = _mm_set_epi32(b[j+7],b[j+6],b[j+5],b[j+4]);
f2 = _mm_set_epi32(b[j+11],b[j+10],b[j+9],b[j+8]);
f3 = _mm_set_epi32(b[j+15],b[j+14],b[j+13],b[j+12]);
__declspec(align(16)) __m128i*b = (__m128i*)ptr;
__m128i temp0,temp1,temp2,temp3,temp4;
temp0 = f0;
temp1 = f1;
temp2 = f2;
temp3 = f3;
temp0 = _mm_add_epi16(temp0, f3);
temp1 = _mm_add_epi16(temp1, f2);
f0 = _mm_sub_epi16(f0, f3);
f1 = _mm_sub_epi16(f1, f2);
temp4 = temp0;
temp4 = _mm_add_epi16(temp4, temp1);
_mm_storeu_si128(b, temp4);
temp0 = _mm_sub_epi16(temp0, temp1);
_mm_storeu_si128(b+2, temp0);
temp1 = f0;
temp4 = f1;
temp1 = _mm_slli_epi16(temp1, 1);
temp4 = _mm_slli_epi16(temp4, 1);
f0 = _mm_add_epi16(f0, temp4);
f1 = _mm_sub_epi16(f1, temp1);
_mm_storeu_si128(b+1, f0);
_mm_storeu_si128(b+3, f1);
#endif
}
推荐答案
_mm_store_si128()
无效的原因是数据未对齐.我相信您对指针变量b
的声明将使该变量具有16个字节的对齐方式,而不是该变量所指向的内容.为了使其与_mm_store_si128()
一起使用,必须确保ptr
参数指向的数据是16字节对齐的._mm_storeu_si128()
内部函数之所以起作用,是因为它使用了未对齐的数据.
The reason_mm_store_si128()
does not work is that the data is not aligned. I believe your declaration(s) of the pointer variableb
will give that variable 16-byte alignment, not what that variable is pointing at. In order for it to work with_mm_store_si128()
, you would have to make sure that the data pointed to by theptr
parameter is 16-byte aligned. The_mm_storeu_si128()
intrinsic works because it uses unaligned data.
这篇关于使用sse2内部函数进行循环展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文