C语言中的整数SIMD指令AVX [英] Integer SIMD Instruction AVX in C

查看:138
本文介绍了C语言中的整数SIMD指令AVX的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对数据类型intfloatdouble运行SIMD指令. 我需要乘法,加法和加载运算.

I am trying to run SIMD instruction over data types int, float and double. I need multiply, add and load operation.

对于floatdouble,我成功地使这些说明起作用:

For float and double I successfully managed to make those instructions work:

_mm256_add_ps_mm256_mul_ps_mm256_load_ps(以* pd结尾为双). (不支持直接FMADD操作)

_mm256_add_ps, _mm256_mul_ps and _mm256_load_ps (ending *pd for double). (Direct FMADD operation isn't supported)

但是对于整数,我找不到有效的指令.在intel AVX手册上显示的所有内容在GCC 4.7中都给出了类似的错误,例如未在此范围内声明'_mm256_mul_epu32'".

But for integer I couldn't find a working instruction. All of those showed at intel AVX manual give similar error by GCC 4.7 like "‘_mm256_mul_epu32’ was not declared in this scope".

对于加载整数,我使用_mm256_set_epi32,这对于GCC很好.我不知道为什么没有定义其他指令.我需要更新一些东西吗?

For loading integer I use _mm256_set_epi32 and that's fine for GCC. I don't know why those other instructions aren't defined. Do I need to update something?

我包括所有这些<pmmintrin.h>, <immintrin.h> <x86intrin.h>

我的处理器是Intel核心i5 3570k(常春藤桥).

My processor is an Intel core i5 3570k (Ivy Bridge).

推荐答案

仅从AVX2开始添加了256位整数运算,因此,如果只有AVX1,则必须对整数内在函数使用128位__m128i向量

256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i vectors for integer intrinsics if you only have AVX1.

AVX1确实具有整数加载/存储,并且_mm256_set_epi32之类的内在函数可以通过FP shuffle或简单地加载编译时常数来实现.

AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32 can be implemented with FP shuffles or a simple load of a compile-time constant.

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2

高级矢量扩展2(AVX2),也称为Haswell新指令,[2]是对Intel Haswell微体系结构中引入的AVX指令集的扩展. AVX2进行了以下补充:

Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions:

  • 大多数矢量整数SSE和AVX指令扩展到256位
  • 三操作数通用位运算和乘法
  • 三操作数融合乘积支持(FMA3)
  • 收集支持,使向量元素可以从不连续的内存位置加载
  • DWORD和QWORD粒度任意排列
  • 向量移位.
  • expansion of most vector integer SSE and AVX instructions to 256 bits
  • three-operand general-purpose bit manipulation and multiply
  • three-operand fused multiply-accumulate support (FMA3)
  • Gather support, enabling vector elements to be loaded from non-contiguous memory locations
  • DWORD- and QWORD-granularity any-to-any permutes
  • vector shifts.

FMA3实际上是一个单独的功能; AMD Piledriver/Steamroller拥有它,但AVX2没有.

FMA3 is actually a separate feature; AMD Piledriver/Steamroller have it but not AVX2.

尽管如此,如果int值范围适合24位,则可以改用float.但是请注意,如果您需要精确的结果或结果的低位,则必须将float转换为double,因为24x24乘法会产生一个48位结果,该结果只能精确地存储在double中.那时,每个向量您仍然只有4个元素,使用int32的XMM向量可能会更好. (但是请注意,FMA吞吐量通常比整数乘法吞吐量要好.)

Nevertheless if the int value range fits in 24 bits then you can use float instead. However note that if you need the exact result or the low bits of the result then you'll have to convert float to double, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32. (But note that FMA throughput is typically better than integer multiply throughput.)

AVX1具有128位整数运算的VEX编码,因此您可以在与256位FP内部函数相同的功能中使用它们,而不会引起SSE-AVX转换停顿. (在C语言中,您通常不必担心这一点;您的编译器将在需要的地方使用vzeroupper.)

AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls. (In C you generally don't have to worry about that; your compiler will take care of using vzeroupper where needed.)

您可以尝试使用像VANDPS和VXORPS这样的AVX逐位指令来模拟整数加法,但是如果ymm向量没有按位左移,将无法正常工作.

You could try to simulate an integer addition with AVX bitwise instructions like VANDPS and VXORPS, but without a bitwise left shift for ymm vectors it won't work.

如果确定未设置FTZ/DAZ,则可以使用小整数 as 反正态/次正态float值,其中尾数之外的位都为零.然后,FP加法和整数加法是相同的按位运算. (当输入和结果都不正常时,VADDPS不需要英特尔硬件上的微码辅助.)

If you're sure FTZ / DAZ are not set, you can use small integers as denormal / subnormal float values, where the bits outside the mantissa are all zero. Then FP addition and integer addition are the same bitwise operation. (And VADDPS doesn't need a microcode assist on Intel hardware when the inputs and result are both denormal.)

这篇关于C语言中的整数SIMD指令AVX的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆