翻译SSE霓虹灯：如何收拾，然后提取32位结果 [英] Translating SSE to Neon: How to pack and then extract 32bit result

查看：749 发布时间：2016/5/29 14:36:28 c++ arm sse neon intrinsics

本文介绍了翻译SSE霓虹灯：如何收拾，然后提取32位结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须从SSE以下说明翻译成明丽

  uint32_t的一个= _mm_cvtsi128_si32（_mm_shuffle_epi8（一，SHUFFLE_MASK））;

其中：

 静态常量__m128i SHUFFLE_MASK = _mm_setr_epi8（3，7，11，15，-1，-1，-1，-1，
                                                  -1，-1，-1，-1，-1，-1，-1，-1）;

所以基本上我不得不采取4日，8日，12日和16字节从寄存器，并把它变成一个 uint32_t的。看起来像一个包装指令（SSE中我似乎记得我用洗牌，因为它相对于包装可以节省一个指令，这个例子显示了使用的包装说明）。

这是如何操作的霓虹灯翻译？
我应该使用包装说明？
我怎么然后提取32位？（有什么等同于 _mm_cvtsi128_si32 ？）

编辑：

首先， vgetq_lane_u32 应该允许替换 _mm_cvtsi128_si32
（但我一定要有我的uint8x16_t转换为uint32x4_t）

  uint32_t的vgetq_lane_u32（uint32x4_t VEC，__constrange（0,3）INT线）;

或直接存储车道 vst1q_lane_u32

 无效vst1q_lane_u32（__ transfersize（1）uint32_t的* PTR，uint32x4_t VAL，__constrange（0,3）INT线）; // VST1.32 {D0 [0]}，[R0]

解决方案

我发现<一个href=\"http://community.arm.com/groups/processors/blog/2012/03/13/coding-for-neon--part-5-rearranging-vectors\"相对=nofollow>这个优秀的指南。
我正在上，似乎我的操作可以用一个VTBL指令（查找表）来完成，但是因为目前它看起来简单，我将用2交织操作实现它。

  uint8x8x2_t vuzp_u8（uint8x8_t一，uint8x8_t B）;

所以是这样的：

  uint8x16_t一个;
uint8_t有*总分;
[...]// A = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0一个= vuzp_u8（vget_low_u8（a）中，vget_high_u8（一））;
// A = 138 0 0 140 0 146 0 147 0 0 0 0 0 0 0 0一个= vuzp_u8（vget_low_u8（a）中，vget_high_u8（一））;
// A = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0vst1q_lane_u32（出，A，0）;

使用

最后一个不给予警告 __ __属性（（优化（LAX-矢量转换）））

但是，因数据转换， 2的分配是不可能。一个解决办法是这样的（修改：这打破严格别名规则的编译器可以假设 A 并没有改变，而分配<$地址C $ C> D ）：

  uint8x8x2_t * D =（uint8x8x2_t *）及一个;
* D = vuzp_u8（vget_low_u8（a）中，vget_high_u8（一））;
* D = vuzp_u8（vget_low_u8（a）中，vget_high_u8（一））;
vst1q_lane_u32（出，A，0）;

我实现了一个更通用的解决方法，通过灵活的数据类型的：

  NeonVectorType＆LT; uint8x16_t＆GT;一个; //一个可作为一个uint8x16_t，uint8x8x2_t，uint32x4_t等
一个= vuzp_u8（vget_low_u8（a）中，vget_high_u8（一））;
一个= vuzp_u8（vget_low_u8（a）中，vget_high_u8（一））;
vst1q_lane_u32（出，A，0）;

编辑：

下面是洗牌面膜/查找表的版本。这让我确实内环快一点点。再次，我已经使用的数据类型描述 href=\"http://stackoverflow.com/a/29213705/2436175\">。

 静态常量uint8x8_t MASK = {0x00,0x04,0x08,0x0C，为0xFF，0xFF的，为0xFF，0xFF的};
NeonVectorType＆LT; uint8x16_t＆GT;一个; //一个可作为一个uint8x16_t，uint8x8x2_t，uint32x4_t等
NeonVectorType＆LT; uint8x8_t＆GT;资源; //水库可用作uint8x8_t，uint32x2_t等
[...]
RES = vtbl2_u8（一，MASK）;
vst1_lane_u32（出，资源，0）;

I have to translate the following instructions from SSE to Neon

 uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );

Where:

static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3,  7,  11, 15, -1, -1, -1, -1,
                                                  -1, -1, -1, -1, -1, -1, -1, -1);

So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).

How does this operation translate in Neon?
Should I use packing instructions?
How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32?)

Edit:
To start with, vgetq_lane_u32 should allow to replace _mm_cvtsi128_si32 (but I will have to cast my uint8x16_t to uint32x4_t)

uint32_t  vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);

or directly store the lane vst1q_lane_u32

void  vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]

解决方案

I found this excellent guide. I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.

uint8x8x2_t   vuzp_u8(uint8x8_t a, uint8x8_t b);

So something like:

uint8x16_t a;
uint8_t* out;
[...]

//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0

a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0

vst1q_lane_u32(out,a,0);

Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))

But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):

uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

I have implemented a more general workaround through a flexible data type:

NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);

Edit:

Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.

static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);

这篇关于翻译SSE霓虹灯：如何收拾，然后提取32位结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

翻译SSE霓虹灯：如何收拾，然后提取32位结果 [英] Translating SSE to Neon: How to pack and then extract 32bit result

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

翻译SSE霓虹灯：如何收拾，然后提取32位结果 [英] Translating SSE to Neon: How to pack and then extract 32bit result

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭