将 SSE 转换为 Neon:如何打包然后提取 32 位结果 [英] Translating SSE to Neon: How to pack and then extract 32bit result
问题描述
我必须将以下指令从 SSE 翻译成 Neon
I have to translate the following instructions from SSE to Neon
uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );
地点:
static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1);
所以基本上我必须从寄存器中取出第 4、8、12 和 16 个字节并将其放入 uint32_t
.看起来像一个打包指令(在 SSE 我似乎记得我使用了 shuffle 因为它与打包相比节省了一条指令,这个例子显示使用包装说明).
So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t
. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).
这个操作在Neon中如何翻译?
我应该使用打包说明吗?
然后我如何提取32位?(是否有任何等效于 _mm_cvtsi128_si32
的东西?)
How does this operation translate in Neon?
Should I use packing instructions?
How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32
?)
首先,vgetq_lane_u32
应该允许替换 _mm_cvtsi128_si32
(但我必须将我的 uint8x16_t 转换为 uint32x4_t)
To start with, vgetq_lane_u32
should allow to replace _mm_cvtsi128_si32
(but I will have to cast my uint8x16_t to uint32x4_t)
uint32_t vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);
或者直接存储lanevst1q_lane_u32
or directly store the lane vst1q_lane_u32
void vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]
推荐答案
我发现 这个优秀的指南.我正在研究这个,似乎我的操作可以用一个 VTBL 指令(查找表)来完成,但我将用 2 个去交错操作来实现它,因为目前它看起来更简单.
I found this excellent guide. I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.
uint8x8x2_t vuzp_u8(uint8x8_t a, uint8x8_t b);
比如:
uint8x16_t a;
uint8_t* out;
[...]
//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0
vst1q_lane_u32(out,a,0);
最后一个使用 __attribute__((optimize("lax-vector-conversions"))) 没有给出警告
但是,由于数据转换,这两个分配是不可能的.一种解决方法是这样的(这违反了严格的别名规则!编译器可以假设a
在分配d
的地址时不会改变.):
But, because of data conversion, the 2 assignments are not possible. One workaround is like this ( This breaks strict aliasing rules! The compiler could assume that a
does not change while assigning the address of d
.):
uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
我通过灵活的数据类型实施了更通用的解决方法:
I have implemented a more general workaround through a flexible data type:
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
这是带有随机掩码/查找表的版本.它确实使我的内部循环更快了一点.同样,我使用了此处描述的数据类型.
Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.
static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);
这篇关于将 SSE 转换为 Neon:如何打包然后提取 32 位结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!