尝试使用x86 asm SSSE3将大字节序转换为小字节序 [英] Trying to convert big to little endian with x86 asm SSSE3

查看:99
本文介绍了尝试使用x86 asm SSSE3将大字节序转换为小字节序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经做了一段时间的arm asm,并尝试使用x86 asm ssse3优化简单的循环.我找不到将大字节序转换为小字节序的方法.

I've have been doing arm asm for a while and tried to optimize simple loops with x86 asm ssse3. I cannot find a way to convert big endian to little endian.

ARM NEON仅具有一个矢量指令即可完成此操作,而SSSE3则没有.我尝试使用2个移位,然后使用一个或,但是如果我们向左移动8个(数据饱和),则需要将每个插槽的位数改为32位而不是16位.

ARM NEON has a single vector instruction to do exactly this, but SSSE3 does not. I tried to use 2 shifts and an or but that requires to go to 32bit per slot instead of 16 if we are shifting by 8 to the left (data gets saturated).

我查看了PSHUFB,但是当我使用它时,16位字的前半部分始终为0.

I looked into PSHUFB but when I use it, the first half of 16 bit word is always 0.

我在用于Android的x86上使用嵌入式asm.很抱歉出现不正确的语法或其他错误,请理解我的意思(很难将其从我的代码中删除).

I am using inline asm on x86 for android. Sorry for the incorrect syntax or other errors that may occur, please understand what I mean (it is hard to rip this out of my code).

# Data
uint16_t dataSrc[] = {0x7000, 0x4401, 0x3801, 0xf002, 0x4800, 0xb802, 0x1800, 
0x3c00, 0xd800.....
uint16_t* src = dataSrc;
uint8_t * dst = new uint8_t[16] = {0};
uint8_t * map = new uint8_t[16] = { 9,8, 11,10, 13,12, 15,14, 1,0,3,2,5,4,7,6,};

# I need to convert 0x7000 to 0x0077 by shifting each 16 bit by its byte vectorized.

asm volatile (
        "movdqu     (%0),%%xmm1\n"
        "pshufb     %2,%%xmm1\n"
        "movdqu     %%xmm1,(%1)\n"
:   "+r" (src),
"+r" (dst),
"+r" (map)
:
:   "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4"
);

如果我遍历dataSrc变量,则前8个字节的输出为:

If I loop through the dataSrc variable my output for the first 8 bytes are:

0: 0
1: 0
2: 0
3: 0
4: 72
5: 696
6: 24
7: 60

即使交换顺序错误,也仅交换最后4个字符.为什么前4个全为零?无论我如何更改地图,第一个有时都为0,接下来的3始终为零,为什么?我做错什么了吗?

Only the last 4 are swapped even if it is in the wrong order. Why are the first 4 all zeros? No matter how i change the map, the first is sometimes 0 and the next 3 are always zero, why? Am i doing something wrong?

修改

我弄清楚了为什么它不起作用,地图没有正确传递到内联汇编中,我不得不为其释放一个输入变量.

I figured out why it didn't work, the map did not pass into the inline asm correctly, I had to free an input variable for it.

有关本征与手写汇编的其他问题.下面的代码是将16字节的视频帧数据YUV42010BE转换为YUVP420(8位),问题在于随机播放,如果我使用little endian,那么我将没有该部分.

For other questions about intrisics vs hand written asm. The code below is to convert 16-byte video frame data YUV42010BE to YUVP420 (8 bit), the problem is with shuffle, if I use little endian, then i would not have that section.

static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 };
int dstStrideOffset = (dstStride - srcStride / 2);
asm volatile (
    "push       %%ebp\n"

    // All 0s for packing
    "xorps      %%xmm0, %%xmm0\n"

    "movdqu     (%5),%%xmm4\n"

    "yloop:\n"

    // Set the counter for the stride
    "mov %2,    %%ebp\n"

    "xloop:\n"

    // Load source data
    "movdqu     (%0),%%xmm1\n"
    "movdqu     16(%0),%%xmm2\n"
    "add        $32,%0\n"

    // The first 4 16-bytes are 0,0,0,0, this is the issue.
    "pshufb      %%xmm4, %%xmm1\n"
    "pshufb      %%xmm4, %%xmm2\n"

    // Shift each 16 bit to the right to convert
    "psrlw      $0x2,%%xmm1\n"
    "psrlw      $0x2,%%xmm2\n"

    // Merge both 16bit vectors into 1 8bit vector
    "packuswb   %%xmm0, %%xmm1\n"
    "packuswb   %%xmm0, %%xmm2\n"
    "unpcklpd   %%xmm2, %%xmm1\n"

    // Write the data
    "movdqu     %%xmm1,(%1)\n"
    "add        $16, %1\n"

    // End loop, x = srcStride; x >= 0 ; x -= 32
    "sub        $32, %%ebp\n"
    "jg         xloop\n"

    // End loop, y = height; y >= 0; --y
    "add %4,    %1\n"
    "sub $1,    %3\n"
    "jg         yloop\n"

    "pop        %%ebp\n"
:   "+r" (src),
    "+r" (dst),
    "+r" (srcStride),
    "+r" (height),
    "+r"(dstStrideOffset)
:   "x"(map)
:   "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4"
);

我还没有使用Little Endian来实现对内在函数的改组

I didn't get around to implement the shuffle for intrinsics yet, using little endian

const int dstStrideOffset = (dstStride - srcStride / 2);
__m128i mdata, mdata2;
const __m128i zeros = _mm_setzero_si128();
for (int y = height; y > 0; --y) {
    for (int x = srcStride; x > 0; x -= 32) {
        mdata = _mm_loadu_si128((const __m128i *)src);
        mdata2 = _mm_loadu_si128((const __m128i *)(src + 8));
        mdata = _mm_packus_epi16(_mm_srli_epi16(mdata, 2), zeros);
        mdata2 = _mm_packus_epi16(_mm_srli_epi16(mdata2, 2), zeros);
        _mm_storeu_si128( (__m128i *)dst, static_cast<__m128i>(_mm_unpacklo_pd(mdata, mdata2)));
        src += 16;
        dst += 16;
    }
    dst += dstStrideOffset;
}

可能写得不正确,但使用默认的编译器设置并添加了诸如此类的优化(尽管性能没有差异)在Android仿真器(API 27),x86(SSSE3最高,i686)上进行了基准测试. -Ofast -O3-funroll-loops -mssse3 -mfpmath = sse 平均:

Probably not written correctly but benchmarking on Android emulator (API 27), x86 (SSSE3 is the highest, i686) with default compiler settings and added optimizations such (although made no difference in performance) -Ofast -O3 -funroll-loops -mssse3 -mfpmath=sse on average:

内在因素:1.9-2.1毫秒手写:0.7-1ms

Intrinics: 1.9-2.1 ms Hand written: 0.7-1ms

有没有办法加快速度?也许我写的是本征函数错误,是否有可能更快地掌握用本征函数手写的速度?

Is there a way to speed this up? Maybe I wrote the intrisics wrong, is it possible to get closer speeds to hand written with intrinics?

推荐答案

您的代码无效,因为您将 map 的地址传递给 pshufb .我不确定gcc会为此生成什么代码,我无法想象它会完全编译.

Your code doesn't work because you pass the address of map to pshufb. I'm not sure what code gcc generates for this, I can't imagine this compiles at all.

对于这种事情,通常使用内联汇编不是一个好主意.而是使用内部函数:

It is usually not a good idea to use inline assembly for this sort of thing. Instead, use intrinsic functions:

#include <immintrin.h>

void byte_swap(char dst[16], const char src[16])
{
    __m128i msrc, map, mdst;

    msrc = _mm_loadu_si128((const _m128i *)src);
    map = _mm_setr_epi8(9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6);
    mdst = _mm_shuffle_epi8(msrc, map);
    _mm_storeu_si128((_m128i *)dst, mdst);
}

除了易于维护之外,由于取消了内联汇编的链接,因此优化效果更好,编译器可以内省内部函数并就要发出的指令做出明智的决定.例如,在AVX目标上,它可能会发出VEX编码的 vpshufb 而不是 pshufb ,以避免由于AVX/SSE转换而停顿.

Apart from being easier to maintain, this optimizes better because unlinke inline assembly, the compiler can introspect intrinsic functions and make informed decisions about which instructions to emit. For example, on an AVX target, it might emit the VEX-encoded vpshufb instead of pshufb to avoid a stall due to an AVX/SSE transition.

如果由于某种原因您不能使用内部函数,请使用内联汇编,如下所示:

If for any reason you cannot use intrinsic functions, use inline assembly like this:

void byte_swap(char dst[16], const char src[16])
{
    typedef long long __m128i_u __attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
    static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 };
    __m128i_u data = *(const __m128i_u *)src;

    asm ("pshufb %1, %0" : "+x"(data) : "xm"(* (__m128i_u *)map));
   *(__m128i_u *)dst = data;
}

这篇关于尝试使用x86 asm SSSE3将大字节序转换为小字节序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆