C++ SSE 内部函数:将结果存储在变量中 [英] C++ SSE Intrinsics: Storing results in variables
问题描述
我无法理解使用 SSE 内在函数将某些 SIMD 计算的结果存储回正常变量".例如,_mm_store_ps 内在函数在英特尔内在函数指南"中的描述如下:
I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:
void _mm_store_ps (float* mem_addr, __m128 a)
void _mm_store_ps (float* mem_addr, __m128 a)
Store 128-bits(由4个压缩单精度(32-bit)组成)浮点元素)从 a 进入内存.mem_addr 必须对齐在 16 字节边界或一般保护异常可能是生成.
Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
第一个参数是一个指向大小为 32 位的浮点数的指针.但是描述指出,内在函数会将 a 中的 128 位复制到目标 mem_addr 中.
The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.
- mem_addr 是否需要是一个包含 4 个浮点数的数组?
- 如何仅访问 a 中的特定 32 位元素并将其存储在单个浮点数中?
- 我在概念上遗漏了什么?
- 还有比 _mm_store_ps 内在函数更好的选择吗?
这是一个简单的结构,其中 doSomething() 将 1 添加到结构的 x/y.缺少的是如何将结果存储回 x/y 而只有更高的 32 位宽元素 2 & 的部分.3 被使用,而 1 &0 未使用.
Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.
struct vec2 {
union {
struct {
float data[2];
};
struct {
float x, y;
};
};
void doSomething() {
__m128 v1 = _mm_setr_ps(x, y, 0, 0);
__m128 v2 = _mm_setr_ps(1, 1, 0, 0);
__m128 result = _mm_add_ps(v1, v2);
// ?? How to store results in x,y ??
}
}
推荐答案
这是一个 128 位的加载或存储,所以 arg 就像 float mem[4]
.请记住,在 C 中,将数组传递给函数/内部函数与传递指针相同.
It's a 128-bit load or store, so yes the arg is like float mem[4]
. Remember that in C, passing an array to a function / intrinsic is the same as passing a pointer.
Intel 的内在函数有些特殊,因为它们不遵循正常的严格别名规则,至少对于整数而言是这样.(例如 _mm_loadu_si128((const __m128i*)some_pointer)
不违反严格别名,即使它是指向 long
的指针.我认为这同样适用于 float/double加载/存储内在函数,因此您可以安全地使用它们来加载/存储您想要的任何内容.通常您会使用 _mm_load_ps
来加载单精度 FP 位模式,通常您会但是,将它们保留在 float
类型的 C 对象中.
Intel's intrinsics are somewhat special because they don't follow the normal strict-aliasing rules, at least for integer. (e.g. _mm_loadu_si128((const __m128i*)some_pointer)
doesn't violate strict-aliasing even if it's a pointer to long
. I think the same applies to the float/double load/store intrinsics, so you can safely use them to load/store from/to whatever you want. Usually you'd use _mm_load_ps
to load single-precision FP bit patterns, and usually you'd be keeping those in C objects of type float
, though.
如何仅访问 a 中的特定 32 位元素并将其存储在单个浮点数中?
How can I access only a specific 32bit element in a and store it in a single float?
先使用向量 shuffle,然后 _mm_cvtss_f32
将向量转换为标量.
Use a vector shuffle and then _mm_cvtss_f32
to cast the vector to scalar.
理想情况下,您可以同时对 2 个打包在一起的向量进行操作,或者对一组 X 值和一组 Y 值进行操作,因此对于一对向量,您将拥有 4 对 XY 坐标对的 X 和 Y 值.(结构数组而不是结构数组).
Ideally you could operate on 2 vectors at once packed together, or an array of X values and an array of Y values, so with a pair of vectors you'd have the X and Y values for 4 pairs of XY coordinates. (struct-of-arrays instead of array-of-structs).
但是你可以像这样有效地表达你想要做的事情:
But you can express what you're trying to do efficiently like this:
struct vec2 {
float x,y;
};
void foo(const struct vec2 *in, struct vec2 *out) {
__m128d tmp = _mm_load_sd( (const double*)in ); //64-bit zero-extending load with MOVSD
__m128 inv = _mm_castpd_ps(tmp); // keep the compiler happy
__m128 result = _mm_add_ps(inv, _mm_setr_ps(1, 1, 0, 0) );
_mm_storel_pi( out, result );
}
GCC 8.2 这样编译(在 Godbolt 上),对于 x86-64 System V,奇怪的是使用 movq
而不是 movsd
来加载.gcc 6.3 使用 movsd
.
GCC 8.2 compiles it like this (on Godbolt), for x86-64 System V, strangely using movq
instead of movsd
for the load. gcc 6.3 uses movsd
.
foo(vec2 const*, vec2*):
movq xmm0, QWORD PTR [rdi] # 64-bit integer load
addps xmm0, XMMWORD PTR .LC0[rip] # packed 128-bit float add
movlps QWORD PTR [rsi], xmm0 # 64-bit store
ret
对于向量低半部分的 64 位存储(2 float
s 或 1 double
),您可以使用 _mm_store_sd
.或者更好的 _mm_storel_pi
(movlps
).不幸的是,它的内在函数需要 __m64*
arg 而不是 float*
,但这只是英特尔内在函数的设计怪癖.它们通常需要类型转换.
For a 64-bit store of the low half of a vector (2 float
s or 1 double
), you can use _mm_store_sd
. Or better _mm_storel_pi
(movlps
). Unfortunately the intrinsic for it wants a __m64*
arg instead of float*
, but that's just a design quirk of Intel's intrinsics. They often require type casting.
请注意,我使用了 _mm_load_sd((const double*)&(in->x))
进行 64 位加载,零扩展到 128 位向量.您不希望 movlps
加载,因为它会合并到现有向量中.这将创建对之前存在的任何值的错误依赖,并花费额外的 ALU uop.
Notice that instead of _mm_setr
, I used _mm_load_sd((const double*)&(in->x))
to do a 64-bit load that zero-extends to a 128-bit vector. You don't want a movlps
load because that merges into an existing vector. That would create a false dependency on whatever value was there before, and costs an extra ALU uop.
这篇关于C++ SSE 内部函数:将结果存储在变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!