将8个字符从内存加载到m256变量作为打包的单精度浮点数 [英] Loading 8 chars from memory into an m256 variable as packed single precision floats

查看：3060 发布时间：2016/10/14 11:08:54 c++ sse simd avx avx2

本文介绍了将8个字符从内存加载到__m256变量作为打包的单精度浮点数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在优化一个图像上的高斯模糊的算法，我想用下面的代码中的float缓冲区[8]替换一个__m256固有变量。

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task?

// unsigned char *new_image is loaded with data
...
  float buffer[8];

  buffer[x ]      = new_image[x];       
  buffer[x + 1] = new_image[x + 1]; 
  buffer[x + 2] = new_image[x + 2]; 
  buffer[x + 3] = new_image[x + 3]; 
  buffer[x + 4] = new_image[x + 4]; 
  buffer[x + 5] = new_image[x + 5]; 
  buffer[x + 6] = new_image[x + 6]; 
  buffer[x + 7] = new_image[x + 7]; 
 // buffer is then used for further operations
...

//What I want instead in pseudocode:
 __m256 b = [float(new_image[x+7]), float(new_image[x+6]), ... , float(new_image[x])];

推荐答案

如果您使用AVX2，可以使用PMOVZX在256b寄存器中将字符零扩展为32位整数。

If you're using AVX2, you can use PMOVZX to zero-extend your chars into 32bit integers in a 256b register. From there, conversion to float can happen in-place.

; rsi = new_image
VPMOVZXBD   ymm0,  [rsi]   ; or SX to sign-extend  (Byte to DWord)
VCVTDQ2PS   ymm0, ymm0     ; convert to packed foat

我的答案在使用SSE2（作为浮动）缩放字节像素值（y = ax + b）？可能是相关的。如果使用AVX2 packssdw / packuswb ，因为它们在内部工作，而不像那样，pack-back-to-bytes之后的部分是半巧妙的vpmovzx 。

My answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)? may be relevant. The pack-back-to-bytes afterward part is semi-tricky if doing it with AVX2 packssdw/packuswb, because they work in-lane, unlike vpmovzx.

由于这个答案，提交了两个gcc错误：

Two gcc bugs submitted because of this answer:

SSE / AVX movq加载（_mm_cvtsi64_si128）未折叠到pmovzx

x86 MOVQ m64，％xmm 在32位模式。（TODO：为clang / LLVM也报告此事）

SSE/AVX movq load (_mm_cvtsi64_si128) not being folded into pmovzx
No intrinsic for x86 MOVQ m64, %xmm in 32bit mode. (TODO: report this for clang/LLVM as well?)

，而不是AVX2，应该：

With only AVX1, not AVX2, you should do:

VPMOVZXBD   xmm0,  [rsi]
VPMOVZXBD   xmm1,  [rsi+4]
VINSERTF128 ymm0, ymm0, xmm1, 1   ; put the 2nd load of data into the high128 of ymm0
VCVTDQ2PS   ymm0, ymm0     ; convert to packed foat.  Yes, works without AVX2

你当然不需要float数组，只是 __ m256 向量。

You of course never need an array of float, just __m256 vectors.

我实际上找不到一种方法来做到这一点内在性是安全的（避免加载期望的8B -O0 ）和最佳（使用 -O3 创建好代码）。

I can't actually find a way to do this with intrinsics that is both safe (avoids loading outside the desired 8B with -O0) and optimal (makes good code with -O3).

没有固有的使用SSE4.1 pmovsx / pmovzx __ m128i 源操作数。然而，他们只读取他们实际使用的数据量。与 punpck * 不同，您可以在页面的最后一个8B上使用它，而不会出错。（和非AVX版本的未对齐地址）。

There's no intrinsic to use SSE4.1 pmovsx / pmovzx as a load, only with a __m128i source operand. However, they only read the amount of data they actually use. Unlike punpck*, you can use this on the last 8B of a page without faulting. (And on unaligned addresses even with the non-AVX version).

movq 5.3没有看到它，仍然将负载折叠成 vpmovzx 的内存操作数。所以函数编译成3个指令。 clang 3.6将movq折叠成pmovzx的内存操作数，但是clang 3.5.1不。 ICC13也制作了最佳代码。

There an intrinsic for movq, but gcc 5.3 doesn't see through it and still fold the load into a memory operand for vpmovzx. So the function is compiled to 3 instructions. clang 3.6 does fold the movq into a memory operand for pmovzx, but clang 3.5.1 does't. ICC13 also makes optimal code.

这里是我提出的邪恶的解决方案。不要使用此， #ifdef __OPTIMIZE __ 是错误。

So here's the evil solution I've come up with. Don't use this, #ifdef __OPTIMIZE__ is Bad.

#if !defined(__OPTIMIZE__)
// Making your code compile differently with/without optimization is a TERRIBLE idea
// great way to create Heisenbugs that disappear when you try to debug them.
// Even if you *plan* to always use -Og for debugging, instead of -O0, this is still evil
#define USE_MOVQ
#endif

__m256 load_bytes_to_m256(uint8_t *p)
{
#ifdef  USE_MOVQ  // compiles to an actual movq then movzx reg, reg with gcc -O3
    __m128i small_load = _mm_cvtsi64_si128( *(uint64_t*)p );
#else  // USE_LOADU // compiles to a 128b load with gcc -O0, potentially segfaulting
    __m128i small_load = _mm_loadu_si128( (__m128i*)p );
#endif

    __m256i intvec = _mm256_cvtepu8_epi32( small_load );
    //__m256i intvec = _mm256_cvtepu8_epi32( *(__m128i*)p );  // compiles to an aligned load with -O0
    return _mm256_cvtepi32_ps(intvec);
}

启用USE_MOVQ后， gcc -O3 （v5.3.0）emit

With USE_MOVQ enabled, gcc -O3 (v5.3.0) emits

load_bytes_to_m256(unsigned char*):
        vmovq   xmm0, QWORD PTR [rdi]
        vpmovzxbd       ymm0, xmm0
        vcvtdq2ps       ymm0, ymm0
        ret

愚蠢的 vmovq 。如果你让它使用 loadu_si128 版本，它会做出很好的优化代码。

The stupid vmovq is what we want to avoid. If you let it use the loadu_si128 version, it will make good optimized code.

使用内在函数编写仅含AVX1的版本对于读者来说是一个无趣的练习。你要求指令，而不是内在性，这是一个地方，在内在性有一个差距。必须使用 _mm_cvtsi64_si128 ，以避免可能从界外地址加载是愚蠢的，IMO。我想要能够根据它们映射到的指令来考虑内在函数，加载/存储内在函数通知编译器关于对齐保证或缺少它们。不必使用内在的指令我不想是很蠢。

Writing the AVX1-only version with intrinsics is left as an un-fun exercise for the reader. You asked for "instructions", not "intrinsics", and this is one place where there's a gap in the intrinsics. Having to use _mm_cvtsi64_si128 to avoid potentially loading from out-of-bounds addresses is stupid, IMO. I want to be able to think of intrinsics in terms of the instructions they map to, with the load/store intrinsics as informing the compiler about alignment guarantees or lack thereof. Having to use the intrinsic for an instruction I don't want is pretty dumb.

另请注意，如果您正在查看Intel insn参考手册，movq有两个单独的条目：

Also note that if you're looking in the Intel insn ref manual, there are two separate entries for movq:

movd / movq，可以有整数寄存器作为src / dest操作数的版本（ 66 REX.W 0F 6E （或 VEX.128.66.0F.W1 6E ）（V）MOVQ xmm，r / m64）。这是您将找到可以接受64位整数的内在。

movd/movq, the version that can have an integer register as a src/dest operand (66 REX.W 0F 6E (or VEX.128.66.0F.W1 6E) for (V)MOVQ xmm, r/m64). That's where you'll find the intrinsic that can accept a 64bit integer.

movq：可以有两个xmm寄存器作为操作数的版本。这一个是MMXreg - > MMXreg指令的扩展，也可以像MOVDQU一样加载/存储。其操作码 F3 0F 7E （ VEX.128.F3.0F.WIG 7E ）for MOVQ xmm，xmm / m64）。这里列出的唯一内在的是 m128i _mm_mov_epi64（__ m128i a）用于在复制向量时将向量的高64b置零。

movq: the version that can have two xmm registers as operands. This one is an extension of the MMXreg -> MMXreg instruction, which can also load/store, like MOVDQU. Its opcode F3 0F 7E (VEX.128.F3.0F.WIG 7E) for MOVQ xmm, xmm/m64). The only intrinsic listed here is the m128i _mm_mov_epi64(__m128i a) for zeroing the high 64b of a vector while copying it.

这真的很蠢。 gcc甚至不为32位目标定义 _mm_cvtsi64_si128 。 vmovq xmm，r / m64 当然不能在32位模式下编码，因为它依赖于VEX.W（或非AVX编码的REX前缀）可以将64位寄存器编码为源，而不是64位存储器位置。你可以使用内部函数来加载到MMX寄存器，然后mmx - > xmm，然后 _mm_mov_epi64 ，但它可能不会优化通过mmx寄存器的反弹。

This is really dumb. gcc doesn't even define _mm_cvtsi64_si128 for 32bit targets. vmovq xmm, r/m64 of course isn't encodable in 32bit mode, since it relies on VEX.W (or a REX prefix for the non-AVX encoding), and could encode a 64bit register as a source, instead of a 64bit memory location. You could maybe use the intrinsic for a load into an MMX register, then mmx -> xmm, then _mm_mov_epi64, but it probably wouldn't optimize away the bounce through the mmx register.

ICC13定义 _mm_cvtsi64_si128 为32位，但是 -O0 它编译为2x vmovd + vpunpckldq 。但它确实可以使用 vmovq 和 -O3 ，（用于单独的测试功能），甚至在32位模式下。所以它不卡住模仿的braindead方式。

ICC13 does define _mm_cvtsi64_si128 for 32bit, but with -O0 it compiles to 2xvmovd + vpunpckldq. It does manage to use vmovq with -O3, though, (for a separate test function) even in 32bit mode. So it isn't stuck emulating the braindead way.

这篇关于将8个字符从内存加载到__m256变量作为打包的单精度浮点数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将8个字符从内存加载到m256变量作为打包的单精度浮点数 [英] Loading 8 chars from memory into an m256 variable as packed single precision floats

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

将8个字符从内存加载到__m256变量作为打包的单精度浮点数 [英] Loading 8 chars from memory into an __m256 variable as packed single precision floats

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

将8个字符从内存加载到m256变量作为打包的单精度浮点数 [英] Loading 8 chars from memory into an m256 variable as packed single precision floats

登录关闭