如何实现“_mm_storeu_epi64”没有混叠问题? [英] How to implement "_mm_storeu_epi64" without aliasing problems?

查看:344
本文介绍了如何实现“_mm_storeu_epi64”没有混叠问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(注意:虽然这个问题是关于store的,但是loadcase有相同的问题,是完全对称的。)



SSE内在函数提供一个 _mm_storeu_pd 函数具有以下签名:

  void _mm_storeu_pd(double * p,__m128d a); 

所以如果我有两个双精度的向量,我想把它存储到一个两个双精度数组,我可以使用这个内在的。



然而,我的向量不是两个双精度;它是两个64位整数,我想把它存储到一个两个64位整数的数组。也就是说,我想要一个具有以下签名的函数:

  void _mm_storeu_epi64(int64_t * p,__m128i a); 

但是内在函数不提供这样的函数。他们最近的是 _mm_storeu_si128

  void _mm_storeu_si128(__m128i * p,__m128i a); 

问题是这个函数接受一个 __ m128i ,而我的数组是 int64_t 的数组。通过错误类型的指针写入对象违反了严格别名化,并且肯定是未定义的行为。我担心我的编译器现在或将来会重新排序或以其他方式优化商店,从而以奇怪的方式打破我的程序。



要清楚,想要的是一个我可以像这样调用的函数:

  __ m128i v = _mm_set_epi64x(2,1) 
int64_t ra [2];
_mm_storeu_epi64(& ra [0],v); //不存在,所以我想实现它

这里有六个尝试来创建这样的函数。



尝试#1



  void _mm_storeu_epi64(int64_t * p, __m128i a){
_mm_storeu_si128(reinterpret_cast< __ m128i *>(p),a);
}

这似乎有我担心的严重别名问题。



尝试#2



  void _mm_storeu_epi64(int64_t * p,__m128i a){ 
_mm_storeu_si128(static_cast< __ m128i *>(static_cast< void *>(p)),a)
}

可能更好一般,但我不认为这在这种情况下有什么区别。



尝试#3



  void _mm_storeu_epi64(int64_t * p,__m128i a){
union TypePun {
int64_t a [2]
__m128i v;
};
TypePun * p_u = reinterpret_cast< TypePun *>(p);
p_u-> v = a;
}

这会在我的编译器(GCC 4.9.0)上生成不正确的代码,指令 movaps 而不是未对齐的 movups 。 (联合是对齐的,因此 reinterpret_cast 使GCC成为假设 p_u 也对齐。)



尝试#4



  void _mm_storeu_epi64(int64_t * p,__m128i a){
union TypePun {
int64_t a [2];
__m128i v;
};
TypePun * p_u = reinterpret_cast< TypePun *>(p);
_mm_storeu_si128(& p_u-> v,a);
}

这看起来像我想要的代码。 类型打击通过联合的伎俩,虽然在C ++ 技术上未定义,广泛支持。但是在这个例子中 - 我传递一个指向联合的元素的指针,而不是通过联合本身访问 - 真的是一个有效的方式使用联合类型冲击?



尝试#5



  void _mm_storeu_epi64(int64_t * p,__m128i a){
p [0] = _mm_extract_epi64(a,0);
p [1] = _mm_extract_epi64(a,1);
}

这是有效的,但它发出两个指令,而不是一个。



尝试#6



  void _mm_storeu_epi64(int64_t * p,__m128i a){
std :: memcpy(p,& a,sizeof(a));
}

这个工作是完全有效的...我想。但它在我的系统上发出坦率的可怕代码。 GCC通过对齐存储将 a 溢出到对齐的堆栈插槽,然后手动将组件字移动到目标。非常奇怪。)



...



有任何方法来编写这个函数,将(a)在典型的现代编译器生成最优代码和(b)具有最小的严格混叠运行的风险。



由于这些内在函数是编译器扩展(),因此这些内在函数是编译器扩展有点由英特尔标准化),它们已经超出了C和C ++语言标准的规范。



尽管事实上SSE内在函数库试图表现得像






意图:

strong>



SSE内在函数可能从头开始设计,以允许向量和标量类型之间的别名,因为向量实际上只是标量类型的聚合。 / p>

但是设计SSE内在函数的人可能不是一种语言。
(这并不奇怪,硬核低级性能程序员和语言律师爱好者往往是非常不同的人,谁不总是相处。)



我们可以看到在加载/存储内在证据:




  • __ m128i _mm_stream_load_si128(__ m128i * mem_addr)非const指针?

  • void _mm_storeu_pd(double * mem_addr,__m128d a) - 如果我要存储到 __ m128i *



严格的别名问题是这些可怜原型的直接结果。 / p>

从AVX512开始,内在函数已全部转换为 void * ,以解决此问题:




  • __ m512d _mm512_load_pd(void const * mem_addr)

  • code> void _mm512_store_epi64(void * mem_addr,__m512i a)





b $ b

编译器详细信息:




  • Visual Studio定义了每个SSE / AVX类型作为标量类型的联合。这本身允许严格的混叠。此外,Visual Studio没有严格的别名,所以这一点是无效的:


  • 英特尔编译器从来没有失败过我的各种别名。


  • GCC做了严格的别名,但是从这个角度来看,我的经验,而不是功能界限。它从来没有失败我投掷传递的指针(任何类型)。 GCC还声明SSE类型为 __ may_alias __ ,从而明确允许它将其他类型别名。







我的建议:





  • 对于在堆栈上声明和别名的变量,使用联合。该联合已经对齐,所以你可以直接读写他们没有内在性。 (但要注意交叉向量/标量访问所带来的存储转发问题。)

  • 如果需要作为一个整体和它的标量组件来访问一个向量,

  • 使用GCC时,请打开 -Wall -Wstrict-别名。它会告诉您有严重别名冲突。


(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.)

The SSE intrinsics provide an _mm_storeu_pd function with the following signature:

void _mm_storeu_pd (double *p, __m128d a);

So if I have vector of two doubles, and I want to store it to an array of two doubles, I can just use this intrinsic.

However, my vector is not two doubles; it is two 64-bit integers, and I want to store it to an array of two 64-bit integers. That is, I want a function with the following signature:

void _mm_storeu_epi64 (int64_t *p, __m128i a);

But the intrinsics provide no such function. The closest they have is _mm_storeu_si128:

void _mm_storeu_si128 (__m128i *p, __m128i a);

The problem is that this function takes a pointer to __m128i, while my array is an array of int64_t. Writing to an object via the wrong type of pointer is a violation of strict aliasing and is definitely undefined behavior. I am concerned that my compiler, now or in the future, will reorder or otherwise optimize away the store thus breaking my program in strange ways.

To be clear, what I want is a function I can invoke like this:

__m128i v = _mm_set_epi64x(2,1);
int64_t ra[2];
_mm_storeu_epi64(&ra[0], v); // does not exist, so I want to implement it

Here are six attempts to create such a function.

Attempt #1

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    _mm_storeu_si128(reinterpret_cast<__m128i *>(p), a);
}

This appears to have the strict aliasing problem I am worried about.

Attempt #2

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    _mm_storeu_si128(static_cast<__m128i *>(static_cast<void *>(p)), a);
}

Possibly better in general, but I do not think it makes any difference in this case.

Attempt #3

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    union TypePun {
        int64_t a[2];
        __m128i v;
     };
    TypePun *p_u = reinterpret_cast<TypePun *>(p);
    p_u->v = a;
}

This generates incorrect code on my compiler (GCC 4.9.0), which emits an aligned movaps instruction instead of an unaligned movups. (The union is aligned, so the reinterpret_cast tricks GCC into assuming p_u is aligned, too.)

Attempt #4

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    union TypePun {
        int64_t a[2];
        __m128i v;
     };
    TypePun *p_u = reinterpret_cast<TypePun *>(p);
    _mm_storeu_si128(&p_u->v, a);
}

This appears to emit the code I want. The "type-punning via union" trick, although technically undefined in C++, is widely-supported. But is this example -- where I pass a pointer to an element of a union rather than access via the union itself -- really a valid way to use the union for type-punning?

Attempt #5

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    p[0] = _mm_extract_epi64(a, 0);
    p[1] = _mm_extract_epi64(a, 1);
}

This works and is perfectly valid, but it emits two instructions instead of one.

Attempt #6

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    std::memcpy(p, &a, sizeof(a));
}

This works and is perfectly valid... I think. But it emits frankly terrible code on my system. GCC spills a to an aligned stack slot via an aligned store, then manually moves the component words to the destination. (Actually it spills it twice, once for each component. Very strange.)

...

Is there any way to write this function that will (a) generate optimal code on a typical modern compiler and (b) have minimal risk of running afoul of strict aliasing?

解决方案

SSE intrinsics is one of those niche corner cases where you have to push the rules a bit.

Since these intrinsics are compiler extensions (somewhat standardized by Intel), they are already outside the specification of the C and C++ language standards. So it's somewhat self-defeating to try to be "standard compliant" while using a feature that clearly is not.

Despite the fact that the SSE intrinsic libraries try to act like normal 3rd party libraries, underneath, they are all specially handled by the compiler.


The Intent:

The SSE intrinsics were likely designed from the beginning to allow aliasing between the vector and scalar types - since a vector really is just an aggregate of the scalar type.

But whoever designed the SSE intrinsics probably wasn't a language pedant.
(That's not too surprising. Hard-core low-level performance programmers and language lawyering enthusiasts tend to be very different groups of people who don't always get along.)

We can see evidence of this in the load/store intrinsics:

  • __m128i _mm_stream_load_si128(__m128i* mem_addr) - A load intrinsic that takes a non-const pointer?
  • void _mm_storeu_pd(double* mem_addr, __m128d a) - What if I want to store to __m128i*?

The strict aliasing problems are a direct result of these poor prototypes.

Starting from AVX512, the intrinsics have all been converted to void* to address this problem:

  • __m512d _mm512_load_pd(void const* mem_addr)
  • void _mm512_store_epi64 (void* mem_addr, __m512i a)

Compiler Specifics:

  • Visual Studio defines each of the SSE/AVX types as a union of the scalar types. This by itself allows strict-aliasing. Furthermore, Visual Studio doesn't do strict-aliasing so the point is moot:

  • The Intel Compiler has never failed me with all sorts of aliasing. It probably doesn't do strict-aliasing either - though I've never found any reliable source for this.

  • GCC does do strict-aliasing, but from my experience, not across function boundaries. It has never failed me to cast pointers which are passed in (on any type). GCC also declares SSE types as __may_alias__ thereby explicitly allowing it to alias other types.


My Recommendation:

  • For function parameters that are of the wrong pointer type, just cast it.
  • For variables declared and aliased on the stack, use a union. That union will already be aligned so you can read/write to them directly without intrinsics. (But be aware of store-forwarding issues that come with interleaving vector/scalar accesses.)
  • If you need to access a vector both as a whole and by its scalar components, consider using insert/extract intrinsics instead of aliasing.
  • When using GCC, turn on -Wall or -Wstrict-aliasing. It will tell you about strict-aliasing violations.

这篇关于如何实现“_mm_storeu_epi64”没有混叠问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆