翻译SSE霓虹灯:如何收拾,然后提取32位结果 [英] Translating SSE to Neon: How to pack and then extract 32bit result

查看:749
本文介绍了翻译SSE霓虹灯:如何收拾,然后提取32位结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从SSE以下说明翻译成明丽

  uint32_t的一个= _mm_cvtsi128_si32(_mm_shuffle_epi8(一,SHUFFLE_MASK));

其中:

 静态常量__m128i SHUFFLE_MASK = _mm_setr_epi8(3,7,11,15,-1,-1,-1,-1,
                                                  -1,-1,-1,-1,-1,-1,-1,-1);

所以基本上我不得不采取4日,8日,12日和16字节从寄存器,并把它变成一个 uint32_t的。看起来像一个包装指令(SSE中我似乎记得我用洗牌,因为它相对于包装可以节省一个指令,这个例子显示了使用的包装说明)。

这是如何操作的霓虹灯翻译?
我应该使用包装说明?
我怎么然后提取32位? (有什么等同于 _mm_cvtsi128_si32 ?)

编辑:

首先, vgetq_lane_u32 应该允许替换 _mm_cvtsi128_si32
(但我一定要有我的uint8x16_t转换为uint32x4_t)

  uint32_t的vgetq_lane_u32(uint32x4_t VEC,__constrange(0,3)INT线);

或直接存储车道 vst1q_lane_u32

 无效vst1q_lane_u32(__ transfersize(1)uint32_t的* PTR,uint32x4_t VAL,__constrange(0,3)INT线); // VST1.32 {D0 [0]},[R0]


解决方案

我发现<一个href=\"http://community.arm.com/groups/processors/blog/2012/03/13/coding-for-neon--part-5-rearranging-vectors\"相对=nofollow>这个优秀的指南。
我正在上,似乎我的操作可以用一个VTBL指令(查找表)来完成,但是因为目前它看起来简单,我将用2交织操作实现它。

  uint8x8x2_t vuzp_u8(uint8x8_t一,uint8x8_t B);

所以是这样的:

  uint8x16_t一个;
uint8_t有*总分;
[...]// A = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
// A = 138 0 0 140 0 146 0 147 0 0 0 0 0 0 0 0一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
// A = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0vst1q_lane_u32(出,A,0);

使用

最后一个不给予警告 __ __属性((优化(LAX-矢量转换)))

但是,因数据转换, 2的分配是不可能。一个解决办法是这样的(修改:这打破严格别名规则的编译器可以假设 A 并没有改变,而分配<$地址C $ C> D ):

  uint8x8x2_t * D =(uint8x8x2_t *)及一个;
* D = vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
* D = vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
vst1q_lane_u32(出,A,0);

我实现了一个更通用的解决方法,通过灵活的数据类型的:

  NeonVectorType&LT; uint8x16_t&GT;一个; //一个可作为一个uint8x16_t,uint8x8x2_t,uint32x4_t等
一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
一个= vuzp_u8(vget_low_u8(a)中,vget_high_u8(一));
vst1q_lane_u32(出,A,0);

编辑:

下面是洗牌面膜/查找表的版本。这让我确实内环快一点点。再次,我已经使用的数据类型描述 href=\"http://stackoverflow.com/a/29213705/2436175\">。

 静态常量uint8x8_t MASK = {0x00,0x04,0x08,0x0C,为0xFF,0xFF的,为0xFF,0xFF的};
NeonVectorType&LT; uint8x16_t&GT;一个; //一个可作为一个uint8x16_t,uint8x8x2_t,uint32x4_t等
NeonVectorType&LT; uint8x8_t&GT;资源; //水库可用作uint8x8_t,uint32x2_t等
[...]
RES = vtbl2_u8(一,MASK);
vst1_lane_u32(出,资源,0);

I have to translate the following instructions from SSE to Neon

 uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );

Where:

static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3,  7,  11, 15, -1, -1, -1, -1,
                                                  -1, -1, -1, -1, -1, -1, -1, -1);

So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).

How does this operation translate in Neon?
Should I use packing instructions?
How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32?)

Edit:
To start with, vgetq_lane_u32 should allow to replace _mm_cvtsi128_si32 (but I will have to cast my uint8x16_t to uint32x4_t)

uint32_t  vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);

or directly store the lane vst1q_lane_u32

void  vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]

解决方案

I found this excellent guide. I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.

uint8x8x2_t   vuzp_u8(uint8x8_t a, uint8x8_t b);

So something like:

uint8x16_t a;
uint8_t* out;
[...]

//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0

vst1q_lane_u32(out,a,0);

Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))

But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):

uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

I have implemented a more general workaround through a flexible data type:

NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

Edit:

Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.

static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);

这篇关于翻译SSE霓虹灯:如何收拾,然后提取32位结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆