使用 SIMD 解交织半字节向量 [英] Deinterleve vector of nibbles using SIMD

查看:43
本文介绍了使用 SIMD 解交织半字节向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由 16384 个带符号的四位整数组成的输入向量.它们被打包成 8192 字节.我需要交错这些值并将其解压缩为两个单独数组中的有符号 8 位整数.

I have an input vector of 16384 signed four bit integers. They are packed into 8192 Bytes. I need to interleave the values and unpack into signed 8 bit integers in two separate arrays.

a,b,c,d 是 4 位值.
A、B、C、D 是 8 位值.

a,b,c,d are 4 bit values.
A,B,C,D are 8 bit values.

输入 = [ab,cd,...]
Out_1 = [A,C, ...]
Out_2 = [B,D, ...]

Input = [ab,cd,...]
Out_1 = [A,C, ...]
Out_2 = [B,D, ...]

我可以很容易地用 C++ 做到这一点.

I can do this quite easily in C++.

constexpr size_t size = 32768;
int8_t input[size]; // raw packed 4bit integers
int8_t out_1[size];
int8_t out_2[size];

for (int i = 0; i < size; i++) {
    out_1[i] = input[i] << 4;
    out_1[i] = out_1[i] >> 4;
    out_2[i] = input[i] >> 4;
}

我想实现它以在通用处理器上尽可能快地运行.8 位解交织到 16 位整数的良好 SIMD 实现存在,例如在 VOLK 中,但我什至找不到基本的字节 SIMD 移位运算符.

I would like to implement this to operate as fast as possible on general purpose processors. Good SIMD implementations of 8 bit deinterleaving to 16 bit integers exist such as in VOLK but I cannot find even basic bytewise SIMD shift operators.

https://github.com/gnuradio/volk/blob/master/kernels/volk/volk_8ic_deinterleave_16i_x2.h#L63

谢谢!

推荐答案

这是一个例子.您的问题包含使用无符号操作的代码,但问题是关于有符号的,所以我不确定您想要什么.如果它是你想要的无符号,只需删除实现符号扩展的位.

Here is an example. Your question contained code that used unsigned operations, but the question asked about signed, so I was not sure what you wanted. If it is unsigned what you want, just remove the bits that implement sign extension.

const __m128i mm_mask = _mm_set1_epi32(0x0F0F0F0F);
const __m128i mm_signed_max = _mm_set1_epi32(0x07070707);

for (size_t i = 0u, n = size / 16u; i < n; ++i)
{
    // Load and deinterleave input half-bytes
    __m128i mm_input_even = _mm_loadu_si128(reinterpret_cast< const __m128i* >(input) + i);
    __m128i mm_input_odd = _mm_srli_epi32(mm_input_even, 4);

    mm_input_even = _mm_and_si128(mm_input_even, mm_mask);
    mm_input_odd = _mm_and_si128(mm_input_odd, mm_mask);

    // If you need sign extension, you need the following
    // Get the sign bits
    __m128i mm_sign_even = _mm_cmpgt_epi8(mm_input_even, mm_signed_max);
    __m128i mm_sign_odd = _mm_cmpgt_epi8(mm_input_odd, mm_signed_max);

    // Combine sign bits with deinterleaved input
    mm_input_even = _mm_or_si128(mm_input_even, _mm_andnot_si128(mm_mask, mm_sign_even));
    mm_input_odd = _mm_or_si128(mm_input_odd, _mm_andnot_si128(mm_mask, mm_sign_odd));

    // Store the results
    _mm_storeu_si128(reinterpret_cast< __m128i* >(out_1) + i, mm_input_even);
    _mm_storeu_si128(reinterpret_cast< __m128i* >(out_2) + i, mm_input_odd);
}

如果您的 size 不是 16 的倍数,那么您还需要添加对尾字节的处理.为此,您可以使用非矢量化代码.

If your size is not a multiple of 16 then you need to also add handling of the tail bytes. You could use your non-vectorized code for that.

请注意,在上面的代码中,您不需要字节粒度移位,因为无论如何您都必须应用掩码.因此,这里可以进行任何更粗粒度的转换.

Note that in the code above you don't need byte-granular shifts as you have to apply the mask anyway. So any more coarse-grained shifts would do here.

这篇关于使用 SIMD 解交织半字节向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆