使用sse2内部函数进行循环展开 [英] loop unrolling using sse2 intrinsics

查看:75
本文介绍了使用sse2内部函数进行循环展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的sse2代码又长又慢,如何快速? _mm_store_si128()失败,但是_mm_storeu_si128()被接受,为什么?

My sse2 code is long and slow, how can I make it fast? _mm_store_si128() failed but _mm_storeu_si128()accepted, why?

void tom::add(void* ptr)
{   
     __declspec(align(16))short* b =(short*)ptr;
     int j;               
       #if cplusplus
	for(j = 0; j < 4; j++)
	   {
	/// 1st stage transform.
	int x0 = (int)(b[j]		+ b[j+12]);
	int x3 = (int)(b[j]		- b[j+12]);
	int x1 = (int)(b[j+4] + b[j+8]);
	int x2 = (int)(b[j+4] - b[j+8]);
	/// 2nd stage transform.
				
	b[j]		= (short)(x0 + x1);
	b[j+8]	= (short)(x0 - x1);
	b[j+4]	= (short)(x2 + (x3 << 1));
	b[j+12]	= (short)(x3 - (x2 << 1));
	}//end for j...
       #else 
		
       __m128i f0,f1,f2,f3;
			
              j=0;
      f0 = _mm_set_epi32(b[j+3],b[j+2],b[j+1],b[j]);
      f1 = _mm_set_epi32(b[j+7],b[j+6],b[j+5],b[j+4]);
      f2 = _mm_set_epi32(b[j+11],b[j+10],b[j+9],b[j+8]);
      f3 = _mm_set_epi32(b[j+15],b[j+14],b[j+13],b[j+12]);
      __declspec(align(16)) __m128i*b = (__m128i*)ptr;
      __m128i temp0,temp1,temp2,temp3,temp4;
	 temp0 = f0;
	 temp1 = f1;
	 temp2 = f2;
       temp3 = f3;
	 temp0 = _mm_add_epi16(temp0, f3);
	 temp1 = _mm_add_epi16(temp1, f2);
	 f0 = _mm_sub_epi16(f0, f3);
	 f1 = _mm_sub_epi16(f1, f2);
	temp4  = temp0;
	temp4 = _mm_add_epi16(temp4, temp1);
	_mm_storeu_si128(b, temp4);
	temp0 = _mm_sub_epi16(temp0, temp1);
	_mm_storeu_si128(b+2, temp0);
	temp1 = f0;
	temp4 = f1;
	temp1 = _mm_slli_epi16(temp1, 1);
	temp4 = _mm_slli_epi16(temp4, 1);
	f0 = _mm_add_epi16(f0, temp4);
	f1 = _mm_sub_epi16(f1, temp1);
	_mm_storeu_si128(b+1, f0);
	_mm_storeu_si128(b+3, f1);
        #endif
}

推荐答案

_mm_store_si128()无效的原因是数据未对齐.我相信您对指针变量b的声明将使该变量具有16个字节的对齐方式,而不是该变量所指向的内容.为了使其与_mm_store_si128()一起使用,必须确保ptr参数指向的数据是16字节对齐的. _mm_storeu_si128()内部函数之所以起作用,是因为它使用了未对齐的数据.
The reason _mm_store_si128() does not work is that the data is not aligned. I believe your declaration(s) of the pointer variable b will give that variable 16-byte alignment, not what that variable is pointing at. In order for it to work with _mm_store_si128(), you would have to make sure that the data pointed to by the ptr parameter is 16-byte aligned. The _mm_storeu_si128() intrinsic works because it uses unaligned data.


这篇关于使用sse2内部函数进行循环展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆