在 SSE 寄存器中混洗偶数和奇数值 [英] Shuffle even and odd vaues in SSE register

查看:62
本文介绍了在 SSE 寄存器中混洗偶数和奇数值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我加载了两个具有 16 位值的 SSE 128 位寄存器.值按以下顺序排列:

I load two SSE 128bit registers with 16 bit values. The values are in the following order:

src[0] = [E_3, O_3, E_2, O_2, E_1, O_1, E_0, O_0]
src[1] = [E_7, O_7, E_6, O_6, E_5, O_5, E_4, O_4]

我想要实现的是这样的订单:

What I want to achieve is an order like this:

src[0] = [E_7, E_6, E_5, E_4, E_3, E_2, E_1, E_0]
src[1] = [O_7, O_6, O_5, O_4, O_3, O_2, O_1, O_0]

您是否知道有什么好的方法可以做到这一点(通过使用 SSE 内部函数直到 SSE 4.2)?

Did you know if there is a good way to do this (by using SSE intrinsics up to SSE 4.2)?

我现在卡住了,因为我无法在 128 位寄存器的上半部分和下半部分之间混洗 16 位值.我只找到了 _mm_shufflelo_epi16_mm_shufflehi_epi16 内在函数.

I'm stuck at the moment, because I can't shuffle 16 bit values between the upper and lower half of the 128bit register. I found only the _mm_shufflelo_epi16 and _mm_shufflehi_epi16 intrinsics.

更新:

感谢 Paul,我已经考虑对 16 位值使用 epi8 内在函数.

Thanks to Paul, I have thought about to use the epi8 intrinsics for the 16bit values.

我的解决方案如下:

shuffle_split = _mm_set_epi8(15, 14, 11, 10,  7,  6,  3,  2, 13, 12,  9,  8,  5,  4,  1,  0);

xtmp[0] = _mm_load_si128(src_vec);
xtmp[1] = _mm_load_si128(src_vec+1);
xtmp[0] = _mm_shuffle_epi8(xtmp[0], shuffle_split);
xtmp[1] = _mm_shuffle_epi8(xtmp[1], shuffle_split);

xsrc[0] = _mm_unpacklo_epi16(xtmp[0], xtmp[1]);
xsrc[0] = _mm_shuffle_epi8(xsrc[0], shuffle_split);
xsrc[1] = _mm_unpackhi_epi16(xtmp[0], xtmp[1]);
xsrc[1] = _mm_shuffle_epi8(xsrc[1], shuffle_split);

还有更好的解决方案吗?

Is there still a better solution?

推荐答案

SSE 中的排列并不容易.有很多方法可以通过各种指令组合来达到相同的结果.不同的组合可能需要不同数量的指令、寄存器或内存访问.与其努力手动解决这样的难题,我更喜欢看看 LLVM 编译器做了什么,所以我用 LLVM 的中间语言编写了一个你想要的排列的简单版本,它利用了极其灵活的向量洗牌指令:

Permutations in SSE are not easy. There are many ways to achieve the same results with various combinations of instructions. Different combinations might require varying numbers of instructions, registers, or memory accesses. Rather than struggle to solve puzzles like this manually, I prefer to just see what the LLVM compiler does, so I wrote a simple version of your desired permutation in LLVM's intermediate language, which takes advantage of an extremely flexible vector shuffle instruction:

define void @shuffle_even_odd(<8 x i16>* %src0) {
  %src1 = getelementptr <8 x i16>* %src0, i64 1
  %a = load <8 x i16>* %src0, align 16
  %b = load <8 x i16>* %src1, align 16
  %x = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %y = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  store <8 x i16> %x, <8 x i16>* %src0, align 16
  store <8 x i16> %y, <8 x i16>* %src1, align 16
  ret void
}

使用 LLVM IR-to-ASM 编译器编译它:llc shuffle_even_odd.ll -o shuffle_even_odd.s 你会得到类似下面的 x86 程序集:

Compile this using the LLVM IR-to-ASM compiler: llc shuffle_even_odd.ll -o shuffle_even_odd.s and you get something like the following x86 assembly:

movdqa  (%rdi), %xmm0
movdqa  16(%rdi), %xmm1
movdqa  %xmm1, %xmm2
pshufb  LCPI0_0(%rip), %xmm2
movdqa  %xmm0, %xmm3
pshufb  LCPI0_1(%rip), %xmm3
por %xmm2, %xmm3
movdqa  %xmm3, (%rdi)
pshufb  LCPI0_2(%rip), %xmm1
pshufb  LCPI0_3(%rip), %xmm0
por %xmm1, %xmm0
movdqa  %xmm0, 16(%rdi)

我已经排除了上面 LCPIO_* 引用的常量数据部分,但这大致转换为以下 C 代码:

I've excluded the constant data sections referenced by LCPIO_* above, but this roughly translates to the following C code:

void shuffle_even_odd(__m128i * src) {
    __m128i shuffle0 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 2, 3, 6, 7, 10, 11, 14, 15);
    __m128i shuffle1 = _mm_setr_epi8(2, 3, 6, 7, 10, 11, 14, 15, 128, 128, 128, 128, 128, 128, 128, 128);
    __m128i shuffle2 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 0, 1, 4, 5, 8, 9, 12, 13);
    __m128i shuffle3 = _mm_setr_epi8(0, 1, 4, 5, 8, 9, 12, 13, 128, 128, 128, 128, 128, 128, 128, 128);
    __m128i a = src[0];
    __m128i b = src[1];
    src[0] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle0), _mm_shuffle_epi8(a, shuffle1));
    src[1] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle2), _mm_shuffle_epi8(a, shuffle3));
}

那只有 4 次 shuffle 和 2 次按位或指令.我怀疑这些按位指令在 CPU 管道中的调度比您建议的解包指令更有效.

That's only 4 shuffle and 2 bitwise-or instructions. I would suspect those bitwise instructions can be scheduled more efficiently in the CPU pipeline than your proposed unpack instructions.

您可以在 LLVM 下载页面的Clang Binaries"包中找到llc"编译器:http://www.llvm.org/releases/download.html

You can find the "llc" compiler in the "Clang Binaries" package from LLVM's download page: http://www.llvm.org/releases/download.html

这篇关于在 SSE 寄存器中混洗偶数和奇数值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆