为什么"+ ="给我SSE内在的意想不到的结果 [英] why does "+=" gives me unexpected result in SSE instrinsic

查看:101
本文介绍了为什么"+ ="给我SSE内在的意想不到的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在sse内在函数中有两种实现累积的方法.但是其中之一得到了错误的结果.

There are two ways of implementation of accumulation in sse intrinsic. But one of them gets the wrong result.

#include <smmintrin.h>

int main(int argc, const char * argv[]) {

int32_t A[4] = {10, 20, 30, 40};
int32_t B[8] = {-1, 2, -3, -4, -5, -6, -7, -8};
int32_t C[4] = {0, 0, 0, 0};
int32_t D[4] = {0, 0, 0, 0};

__m128i lv = _mm_load_si128((__m128i *)A);
__m128i rv = _mm_load_si128((__m128i *)B);

// way 1 unexpected
rv += lv;
_mm_store_si128((__m128i *)C, rv);

// way 2 expected
rv = _mm_load_si128((__m128i *)B);
rv = _mm_add_epi32(lv, rv);
_mm_store_si128((__m128i *)D, rv);

return 0;
}

预期结果是:

9 22 27 36

9 22 27 36

C是:

9 23 27 37

9 23 27 37

D是:

9 22 27 36

9 22 27 36

推荐答案

在GNU C中, __ m128i 被定义为 64位整数的向量,其中包括

In GNU C, __m128i is defined as a vector of 64-bit integers, with something like

typedef long long __m128i __attribute__((vector_size(16), may_alias));

使用GNU C本机矢量语法( + 运算符)可对每个元素添加64位元素大小.即 _mm_add_epi64 .

Using GNU C native vector syntax (the + operator) does a per-element add with 64-bit element size. i.e. _mm_add_epi64.

在您的情况下,从一个32位元素的顶部进行进位会在其上方的32位元素上添加一个额外的进位,因为64位元素的大小确实会在成对的32位元素之间传播进位.(将负数添加到非零目标会产生结转.)

In your case, carry-out from the top of one 32-bit element added an extra one to the 32-bit element above it, because 64-bit element size does propagate carry between pairs of 32-bit elements. (Adding a negative to a non-zero destination produces a carry-out.)

Intel内部函数API没有为 __ m128 / __ m128d / __ m128i 定义 + 运算符.例如,您的代码将无法在MSVC上编译.

The Intel intrinsics API doesn't define the + operator for __m128 / __m128d / __m128i. Your code won't compile on MSVC, for example.

因此,您得到的行为仅来自GCC标头中内在类型的实现细节.对于具有明显元素大小的浮点向量很有用,但是对于整数向量,除非确实碰巧具有64位整数,否则您要定义自己的向量.

So the behaviour you're getting is only from the implementation details of intrinsic types in GCC's headers. It's useful for float vectors where there is an obvious element size, but for integer vectors you'd want to define your own unless you do happen to have 64-bit integers.

如果您希望能够使用 v1 + = v2; ,则可以定义自己的GNU C本机向量类型,例如

If you want to be able to use v1 += v2; you can define your own GNU C native vector types, like

typedef uint32_t v4ui __attribute__((vector_size(16), aligned(4)));

请注意,我省略了 may_alias ,因此仅将指针强制转换为 unsigned 而不是读取诸如 char [] 之类的任意数据是安全的

Note I left out the may_alias, so it's only safe to cast pointers to unsigned, not to read arbitrary data like char[].

事实上,GCC的 emmintrin.h (SSE2)确实定义了一堆类型:

In fact GCC's emmintrin.h (SSE2) does define a bunch of types:

/* SSE2 */
typedef double __v2df __attribute__ ((__vector_size__ (16)));
typedef long long __v2di __attribute__ ((__vector_size__ (16)));
typedef unsigned long long __v2du __attribute__ ((__vector_size__ (16)));
typedef int __v4si __attribute__ ((__vector_size__ (16)));
typedef unsigned int __v4su __attribute__ ((__vector_size__ (16)));
typedef short __v8hi __attribute__ ((__vector_size__ (16)));
typedef unsigned short __v8hu __attribute__ ((__vector_size__ (16)));
typedef char __v16qi __attribute__ ((__vector_size__ (16)));
typedef unsigned char __v16qu __attribute__ ((__vector_size__ (16)));

我不确定它们是否打算供外部使用.

I'm not sure if they're intended for external use.

当您想让编译器发出有效的代码以除以编译时常数或类似的东西时,GNU C本机向量最有用.例如具有16位无符号整数的 digit = v1%10; v1//= 10; 将编译为 pmulhuw 并向右移.但是它们对于可读代码也很方便.

GNU C native vectors are most useful when you want to get the compiler to emit efficient code for division by a compile-time constant, or something like that. e.g. digit = v1 % 10; and v1 /= 10; with 16-bit unsigned integers will compile to pmulhuw and a right shift. But they're also just handy for readable code.

有些C ++包装库可移植地提供操作符重载,并且具有类似 Vec4i (4x带符号的int)/ Vec4u (4x带符号的int)/的类型.Vec16c (16个带符号的char)为您提供了用于不同类型整数向量的类型系统,因此您知道从 v1 + = v2; v1>获得的内容;> = 2; (右移是签名重要的一种情况.)

There are some C++ wrapper libraries that portably provide operator overloads, and have types like Vec4i (4x signed int) / Vec4u (4x unsigned int) / Vec16c (16x signed char) to give you a type system for different kinds of integer vectors, so you know what you're getting from v1 += v2; or v1 >>= 2; (Right shifts are one case where the signedness matters.)

例如Agner Fog的VCL(GPL许可)或DirectXMath(MIT许可).

e.g. Agner Fog's VCL (GPL license), or DirectXMath (MIT license).

这篇关于为什么"+ ="给我SSE内在的意想不到的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆