翻译SSE霓虹灯:如何收拾,然后提取32位结果 [英] Translating SSE to Neon: How to pack and then extract 32bit result
问题描述
我必须从SSE以下说明翻译成明丽
uint32_t的一个= _mm_cvtsi128_si32(_mm_shuffle_epi8(一,SHUFFLE_MASK));
其中:
静态常量__m128i SHUFFLE_MASK = _mm_setr_epi8(3,7,11,15,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1);
所以基本上我不得不采取4日,8日,12日和16字节从寄存器,并把它变成一个 uint32_t的
。看起来像一个包装指令(SSE中我似乎记得我用洗牌,因为它相对于包装可以节省一个指令,这个例子显示了使用的包装说明)。
这是如何操作的霓虹灯翻译?
我应该使用包装说明?
我怎么然后提取32位? (有什么等同于 _mm_cvtsi128_si32
?)
编辑:
首先, vgetq_lane_u32
应该允许替换 _mm_cvtsi128_si32
(但我一定要有我的uint8x16_t转换为uint32x4_t)
uint32_t的vgetq_lane_u32(uint32x4_t VEC,__constrange(0,3)INT线);
或直接存储车道 vst1q_lane_u32
无效vst1q_lane_u32(__ transfersize(1)uint32_t的* PTR,uint32x4_t VAL,__constrange(0,3)INT线); // VST1.32 {D0 [0]},[R0]
我发现<一个href=\"http://community.arm.com/groups/processors/blog/2012/03/13/coding-for-neon--part-5-rearranging-vectors\"相对=nofollow>这个优秀的指南。
我正在上,似乎我的操作可以用一个VTBL指令(查找表)来完成,但是因为目前它看起来简单,我将用2交织操作实现它。
uint8x8x2_t vuzp_u8(uint8x8_t一,uint8x8_t B);
所以是这样的:
uint8x16_t一个;
uint8_t有*总分;
[...]// A = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
// A = 138 0 0 140 0 146 0 147 0 0 0 0 0 0 0 0一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
// A = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0vst1q_lane_u32(出,A,0);
使用
最后一个不给予警告 __ __属性((优化(LAX-矢量转换)))
但是,因数据转换, 2的分配是不可能。一个解决办法是这样的(修改:这打破严格别名规则的编译器可以假设 A
并没有改变,而分配<$地址C $ C> D ):
uint8x8x2_t * D =(uint8x8x2_t *)及一个;
* D = vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
* D = vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
vst1q_lane_u32(出,A,0);
我实现了一个更通用的解决方法,通过灵活的数据类型的:
NeonVectorType&LT; uint8x16_t&GT;一个; //一个可作为一个uint8x16_t,uint8x8x2_t,uint32x4_t等
一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
vst1q_lane_u32(出,A,0);
编辑:
下面是洗牌面膜/查找表的版本。这让我确实内环快一点点。再次,我已经使用的数据类型描述 href=\"http://stackoverflow.com/a/29213705/2436175\">。
静态常量uint8x8_t MASK = {0x00,0x04,0x08,0x0C,为0xFF,0xFF的,为0xFF,0xFF的};
NeonVectorType&LT; uint8x16_t&GT;一个; //一个可作为一个uint8x16_t,uint8x8x2_t,uint32x4_t等
NeonVectorType&LT; uint8x8_t&GT;资源; //水库可用作uint8x8_t,uint32x2_t等
[...]
RES = vtbl2_u8(一,MASK);
vst1_lane_u32(出,资源,0);
I have to translate the following instructions from SSE to Neon
uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );
Where:
static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1);
So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t
. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).
How does this operation translate in Neon?
Should I use packing instructions?
How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32
?)
Edit:
To start with, vgetq_lane_u32
should allow to replace _mm_cvtsi128_si32
(but I will have to cast my uint8x16_t to uint32x4_t)
uint32_t vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);
or directly store the lane vst1q_lane_u32
void vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]
I found this excellent guide. I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.
uint8x8x2_t vuzp_u8(uint8x8_t a, uint8x8_t b);
So something like:
uint8x16_t a;
uint8_t* out;
[...]
//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0
vst1q_lane_u32(out,a,0);
Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))
But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a
does not change while assigning the address of d
.):
uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
I have implemented a more general workaround through a flexible data type:
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
Edit:
Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.
static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);
这篇关于翻译SSE霓虹灯:如何收拾,然后提取32位结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!