QWORD使用SIMD SSE ... AVX将顺序7位字节对齐 [英] QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE...AVX

查看:97
本文介绍了QWORD使用SIMD SSE ... AVX将顺序7位字节对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在SIMD的任何指令系列中是否可以进行以下操作.

I would like to know if the following is possible in any of the SIMD families of instructions.

我有一个带有63个有效位的qword输入(绝不为负).从LSB开始的每个连续的7位将与一个字节进行混洗对齐,左填充为1(最高有效的非零字节除外).为了说明,为了清楚起见,我将使用字母.

I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake.

结果仅是有效字节,因此大小为0-9,并将其转换为字节数组.

The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array.

In:         0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ|IHGFEDC|BAzyxwv|utsrqpo|nmlkjih|gfedcba
Out: 0kjihgfe|1dcbaZYX|1WVUTSRQ|1PONMLKJ|1IHGFEDC|1BAzyxwv|1utsrqpo|1nmlkjih|1gfedcba

大小= 9

In:  00|nmlkjih|gfedcba
Out: |0nmlkjih|1gfedcba

大小= 2

我知道填充是分开的.随机调整是我的问题. 这可能吗?

I do understand the padding is separate. The shuffle-aligning is my question. Is this possible?

编辑2

这是我更新的代码.在单线程Core 2 Duo 2 GHz(64位)上的随机长度输入中获得持续的46 M/sec的速度.

Here is my updated code. Gets a sustained 46 M / sec for random-length input on single thread Core 2 Duo 2 GHz, 64 bit.

private static int DecodeIS8(long j, ref byte[] result)
{
    if (j <= 0)
    {
        return 0;
    }

    int size;

    // neater code: gives something to break out of
    while (true)
    {
        result[0] = (byte)((j & 0x7F) | 0x80);
        size = 0;
        j >>= 7;

        if (j == 0) break;

        result[1] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[2] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[3] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[4] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[5] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[6] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[7] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[8] = (byte)j;

        return 9;
    }

    result[size] ^= 0x80;

    return size + 1;
}

推荐答案

是的,可以使用MMX/SSE的pmullw指令(本征函数:_mm_mullo_pi16)进行逐元素转换.

Yes, it's possible to use MMX/SSE's pmullw instruction (intrinsic function: _mm_mullo_pi16) to do per-element shifts.

基本思想是使用AND指令提取交替的7位元素,并执行pmullw将元素移位到适当位置.这样就可以完成一半元素的任务,因此需要重复进行两次额外的操作才能重复该过程.

The basic idea is to extract alternating 7-bit elements with an AND instruction and perform the pmullw to shift the elements into place. This will accomplish the task for half of the elements, so the process will need to be repeated with a couple of extra shifts.

#include <stdio.h>
#include <stdint.h>
#include <mmintrin.h>

__m64 f(__m64 input) {
    static const __m64 mask = (__m64) 0xfe03f80fe03f80UL;
    static const __m64 multiplier = (__m64) 0x0080002000080002UL;

    __m64 t0 = _mm_and_si64 (input, mask);
    __m64 t1 = _mm_and_si64 (_mm_srli_si64 (input, 7), mask);

    t0 = _mm_mullo_pi16 (t0, multiplier);
    t1 = _mm_mullo_pi16 (t1, multiplier);

    __m64 res =  _mm_or_si64 (t0, _mm_slli_si64 (t1, 8));
    /* set most significant bits, except for in most significant byte */
    return _mm_or_si64 (res, (__m64) 0x0080808080808080UL);
}

int main(int argc, char *argv[])
{
    int i;
    typedef union {
            __m64 m64;
            unsigned char _8x8[8];
    } type_t;

    /* 0x7f7e7c7870608080 = {127, 63, 31, 15, 7, 3, 2, 1, 0} */
    type_t res0 = { .m64 = f((__m64) 0x7f7e7c7870608080UL) };

    for (i = 0; i < 8; i++) {
            printf("%3u ", res0._8x8[i]);
    }
    puts("");

    return 0;
}

mask提取交替的7位元素. multiplier是一个常量,允许我们指定每个元素的移位.它是通过查看屏蔽的输入得出的:

The mask extracts alternating 7-bit elements. The multiplier is a constant which allows us to specify per-element shifts. It's derived from looking at the masked input:

00000000|dcbaZYX0|000000PO|NMLKJ000|0000BAzy|xwv00000|00nmlkji|h0000000

并意识到这一点

00000000|dcbaZYX0 needs to be shifted by 7 (or multiplied by 2^7, 128, 0x0080)
000000PO|NMLKJ000 needs to be shifted by 5 (or multiplied by 2^5,  32, 0x0020)
0000BAzy|xwv00000 needs to be shifted by 3 (or multiplied by 2^3,   8, 0x0008)
00nmlkji|h0000000 needs to be shifted by 1 (or multiplied by 2^1,   2, 0x0002)

此函数一次写入8个字节(而不是9个7位元素将解压缩到的9个字节),因此每次迭代后,您只需要将源指针前进7个字节即可.因此,转换为SSE2有点复杂.

This function writes 8-bytes at a time (instead of the 9-bytes your 9 7-bit elements would unpack to), so you'll have to advance the source pointer by only 7-bytes after each iteration. Because of this, a conversion to SSE2 is a bit more complicated.

我认为不可能为t1使用不同的掩码和乘数以避免移位,因为t1的元素将跨越16位边界,这将阻止pmullw工作.但是,仍然有可能以某种方式进行优化.

I don't think it's possible to use a different mask and multiplier for t1 in order to avoid the shifts, since t1's elements will cross 16-bit boundaries, which will prevent pmullw from working. But, it still might be possible to optimize somehow.

我还没有对此进行基准测试,但是我怀疑它的速度明显快于您的标量版本.如果您对其进行基准测试,请发布结果.我会很感兴趣看到他们的.

I haven't benchmarked this, but I suspect it's significantly faster than your scalar version. If you benchmark it, please post the results. I'd be very interested to see them.

总而言之,该算法得出的结果是2个移位,2个ors,2个ands,以及两个乘法(以及几步)以生成8字节.

All in all, the algorithm comes out to be 2 shifts, 2 ors, 2 ands, and two multiplies (and a few moves) to generate 8-bytes.

这篇关于QWORD使用SIMD SSE ... AVX将顺序7位字节对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆