C++ SSE 内部函数:将结果存储在变量中 [英] C++ SSE Intrinsics: Storing results in variables

查看:36
本文介绍了C++ SSE 内部函数:将结果存储在变量中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法理解使用 SSE 内在函数将某些 SIMD 计算的结果存储回正常变量".例如,_mm_store_ps 内在函数在英特尔内在函数指南"中的描述如下:

I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:

void _mm_store_ps (float* mem_addr, __m128 a)

void _mm_store_ps (float* mem_addr, __m128 a)

Store 128-bits(由4个压缩单精度(32-bit)组成)浮点元素)从 a 进入内存.mem_addr 必须对齐在 16 字节边界或一般保护异常可能是生成.

Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

第一个参数是一个指向大小为 32 位的浮点数的指针.但是描述指出,内在函数会将 a 中的 128 位复制到目标 mem_addr 中.

The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.

  • mem_addr 是否需要是一个包含 4 个浮点数的数组?
  • 如何仅访问 a 中的特定 32 位元素并将其存储在单个浮点数中?
  • 我在概念上遗漏了什么?
  • 还有比 _mm_store_ps 内在函数更好的选择吗?

这是一个简单的结构,其中 doSomething() 将 1 添加到结构的 x/y.缺少的是如何将结果存储回 x/y 而只有更高的 32 位宽元素 2 & 的部分.3 被使用,而 1 &0 未使用.

Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.

struct vec2 {
   union {
         struct {
            float data[2];
         };
         struct {
            float x, y;
         };
      };

   void doSomething() {
      __m128 v1 = _mm_setr_ps(x, y, 0, 0);
      __m128 v2 = _mm_setr_ps(1, 1, 0, 0);
      __m128 result = _mm_add_ps(v1, v2);
      // ?? How to store results in x,y ??
   }
}

推荐答案

这是一个 128 位的加载或存储,所以 arg 就像 float mem[4].请记住,在 C 中,将数组传递给函数/内部函数与传递指针相同.

It's a 128-bit load or store, so yes the arg is like float mem[4]. Remember that in C, passing an array to a function / intrinsic is the same as passing a pointer.

Intel 的内在函数有些特殊,因为它们不遵循正常的严格别名规则,至少对于整数而言是这样.(例如 _mm_loadu_si128((const __m128i*)some_pointer) 不违反严格别名,即使它是指向 long 的指针.我认为这同样适用于 float/double加载/存储内在函数,因此您可以安全地使用它们来加载/存储您想要的任何内容.通常您会使用 _mm_load_ps 来加载单精度 FP 位模式,通常您会但是,将它们保留在 float 类型的 C 对象中.

Intel's intrinsics are somewhat special because they don't follow the normal strict-aliasing rules, at least for integer. (e.g. _mm_loadu_si128((const __m128i*)some_pointer) doesn't violate strict-aliasing even if it's a pointer to long. I think the same applies to the float/double load/store intrinsics, so you can safely use them to load/store from/to whatever you want. Usually you'd use _mm_load_ps to load single-precision FP bit patterns, and usually you'd be keeping those in C objects of type float, though.

如何仅访问 a 中的特定 32 位元素并将其存储在单个浮点数中?

How can I access only a specific 32bit element in a and store it in a single float?

先使用向量 shuffle,然后 _mm_cvtss_f32 将向量转换为标量.

Use a vector shuffle and then _mm_cvtss_f32 to cast the vector to scalar.

理想情况下,您可以同时对 2 个打包在一起的向量进行操作,或者对一组 X 值和一组 Y 值进行操作,因此对于一对向量,您将拥有 4 对 XY 坐标对的 X 和 Y 值.(结构数组而不是结构数组).

Ideally you could operate on 2 vectors at once packed together, or an array of X values and an array of Y values, so with a pair of vectors you'd have the X and Y values for 4 pairs of XY coordinates. (struct-of-arrays instead of array-of-structs).

但是你可以像这样有效地表达你想要做的事情:

But you can express what you're trying to do efficiently like this:

struct vec2 {
    float x,y;
};

void foo(const struct vec2 *in, struct vec2 *out) {
    __m128d tmp = _mm_load_sd( (const double*)in );  //64-bit zero-extending load with MOVSD
    __m128  inv = _mm_castpd_ps(tmp);             // keep the compiler happy
    __m128  result = _mm_add_ps(inv,  _mm_setr_ps(1, 1, 0, 0) );

    _mm_storel_pi( out, result );
}

GCC 8.2 这样编译(在 Godbolt 上),对于 x86-64 System V,奇怪的是使用 movq 而不是 movsd 来加载.gcc 6.3 使用 movsd.

GCC 8.2 compiles it like this (on Godbolt), for x86-64 System V, strangely using movq instead of movsd for the load. gcc 6.3 uses movsd.

foo(vec2 const*, vec2*):
        movq    xmm0, QWORD PTR [rdi]           # 64-bit integer load
        addps   xmm0, XMMWORD PTR .LC0[rip]     # packed 128-bit float add
        movlps  QWORD PTR [rsi], xmm0           # 64-bit store
        ret

对于向量低半部分的 64 位存储(2 floats 或 1 double),您可以使用 _mm_store_sd.或者更好的 _mm_storel_pi (movlps).不幸的是,它的内在函数需要 __m64* arg 而不是 float*,但这只是英特尔内在函数的设计怪癖.它们通常需要类型转换.

For a 64-bit store of the low half of a vector (2 floats or 1 double), you can use _mm_store_sd. Or better _mm_storel_pi (movlps). Unfortunately the intrinsic for it wants a __m64* arg instead of float*, but that's just a design quirk of Intel's intrinsics. They often require type casting.

请注意,我使用了 _mm_load_sd((const double*)&(in->x)) 进行 64 位加载,零扩展到 128 位向量.您不希望 movlps 加载,因为它会合并到现有向量中.这将创建对之前存在的任何值的错误依赖,并花费额外的 ALU uop.

Notice that instead of _mm_setr, I used _mm_load_sd((const double*)&(in->x)) to do a 64-bit load that zero-extends to a 128-bit vector. You don't want a movlps load because that merges into an existing vector. That would create a false dependency on whatever value was there before, and costs an extra ALU uop.

这篇关于C++ SSE 内部函数:将结果存储在变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆