硬件SIMD向量指针和相应类型之间的"reinterpret_cast"运算是否是未定义的行为? [英] Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

查看:181
本文介绍了硬件SIMD向量指针和相应类型之间的"reinterpret_cast"运算是否是未定义的行为?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

reinterpret_castfloat*转换为__m256*并通过不同的指针类型访问float对象是否合法?

Is it legal to reinterpret_cast a float* to a __m256* and access float objects through a different pointer type?

constexpr size_t _m256_float_step_sz = sizeof(__m256) / sizeof(float);
alignas(__m256) float stack_store[100 * _m256_float_step_sz ]{};
__m256& hwvec1 = *reinterpret_cast<__m256*>(&stack_store[0 * _m256_float_step_sz]);

using arr_t = float[_m256_float_step_sz];
arr_t& arr1 = *reinterpret_cast<float(*)[_m256_float_step_sz]>(&hwvec1);

hwvec1arr1是否取决于undefined behavior s?

它们是否违反严格的别名规则? [basic.lval]/11

Do they violate strict aliasing rules? [basic.lval]/11

或者只有一种定义的内在方式:

Or there is only one defined way of intrinsic:

__m256 hwvec2 = _mm256_load_ps(&stack_store[0 * _m256_float_step_sz]);
_mm256_store_ps(&stack_store[1 * _m256_float_step_sz], hwvec2);

godbolt

推荐答案

ISO C ++没有定义__m256,因此我们需要研究做了什么在支持的实现上定义它们的行为他们.

ISO C++ doesn't define __m256, so we need to look at what does define their behaviour on the implementations that support them.

英特尔内部函数将像__m256*这样的矢量指针定义为允许别名,就像ISO C ++将char*定义为允许别名一样.

Intel's intrinsics define vector-pointers like __m256* as being allowed to alias anything else, the same way ISO C++ defines char* as being allowed to alias.

所以是的,取消引用__m256*而不是使用_mm256_load_ps()对齐负载内在函数是安全的.

So yes, it's safe to dereference a __m256* instead of using a _mm256_load_ps() aligned-load intrinsic.

但是特别是对于float/double,使用内在函数通常更容易,因为它们也负责从float*进行强制转换.对于整数,AVX512加载/存储内在函数定义为采用void*,但是在此之前,您需要一个额外的(__m256i*),这很混乱.

But especially for float/double, it's often easier to use the intrinsics because they take care of casting from float*, too. For integers, the AVX512 load/store intrinsics are defined as taking void*, but before that you need an extra (__m256i*) which is just a lot of clutter.

在gcc中,这是通过使用may_alias属性定义__m256来实现的:来自gcc7.3的avxintrin.h(<immintrin.h>包含的头之一):

In gcc, this is implemented by defining __m256 with a may_alias attribute: from gcc7.3's avxintrin.h (one of the headers that <immintrin.h> includes):

/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */
typedef float __m256 __attribute__ ((__vector_size__ (32),
                                     __may_alias__));
typedef long long __m256i __attribute__ ((__vector_size__ (32),
                                          __may_alias__));
typedef double __m256d __attribute__ ((__vector_size__ (32),
                                       __may_alias__));

/* Unaligned version of the same types.  */
typedef float __m256_u __attribute__ ((__vector_size__ (32),
                                       __may_alias__,
                                       __aligned__ (1)));
typedef long long __m256i_u __attribute__ ((__vector_size__ (32),
                                            __may_alias__,
                                            __aligned__ (1)));
typedef double __m256d_u __attribute__ ((__vector_size__ (32),
                                         __may_alias__,
                                         __aligned__ (1)));

(如果您想知道,这就是为什么取消引用__m256*就像_mm256_store_ps而不是storeu的原因.)

(In case you were wondering, this is why dereferencing a __m256* is like _mm256_store_ps, not storeu.)

GNU C本机向量可以别名化其标量类型,例如即使没有may_alias,也可以安全地在float*和假定的v8sf类型之间进行转换.但是may_alias使其可以安全地从int[]char[]或其他任何数组加载.

GNU C native vectors without may_alias are allowed to alias their scalar type, e.g. even without the may_alias, you could safely cast between float* and a hypothetical v8sf type. But may_alias makes it safe to load from an array of int[], char[], or whatever.

我说的是GCC如何实现Intel的内在函数,因为那是我所熟悉的.我从gcc开发人员那里听说,他们选择该实现是因为与Intel兼容是必需的.

I'm talking about how GCC implements Intel's intrinsics only because that's what I'm familiar with. I've heard from gcc developers that they chose that implementation because it was required for compatibility with Intel.

使用Intel的_mm_storeu_si128( (__m128i*)&arr[i], vec); API需要您创建可能未对齐的指针,如果您对它们进行引用会出错.并且_mm_storeu_ps到未对齐4字节的位置需要创建未对齐的float*.

Using Intel's API for _mm_storeu_si128( (__m128i*)&arr[i], vec); requires you to create potentially-unaligned pointers which would fault if you deferenced them. And _mm_storeu_ps to a location that isn't 4-byte aligned requires creating an under-aligned float*.

即使没有取消引用,只是创建未对齐的指针或对象外部的指针在ISO C ++中都是UB.硬件,它们在创建指针时(可能是在取消引用时)会对指针进行某种检查,或者可能无法存储指针的低位. (我不知道是否存在任何特定的硬件,因为此UB可以实现更高效的代码.)

Just creating unaligned pointers, or pointers outside an object, is UB in ISO C++, even if you don't dereference them. I guess this allows implementations on exotic hardware which do some kinds of checks on pointers when creating them (possibly instead of when dereferencing), or maybe which can't store the low bits of pointers. (I have no idea if any specific hardware exists where more efficient code is possible because of this UB.)

但是支持Intel内在函数的实现必须至少对__m*类型和float*/double*定义行为.对于面向任何普通现代CPU的编译器而言,这都是微不足道的,包括具有平面内存模型(无分段)的x86; asm中的指针只是与数据保存在同一寄存器中的整数. (m68k具有地址和数据寄存器,但是只要您不对它们进行解引用,就可以将不是有效地址的位模式保留在A寄存器中,这从不会出错.)

But implementations which support Intel's intrinsics must define the behaviour, at least for the __m* types and float*/double*. This is trivial for compilers targeting any normal modern CPU, including x86 with a flat memory model (no segmentation); pointers in asm are just integers kept in the same registers as data. (m68k has address vs. data registers, but it never faults from keeping bit-patterns that aren't valid addresses in A registers, as long as you don't deref them.)

请注意,像char*别名规则一样,may_alias仅采用一种方式:保证使用int32_t*进行读取是安全的__m256.使用float*读取__m256甚至可能都不安全.就像执行char buf[1024]; int *p = (int*)buf;一样不安全.

Note that may_alias, like the char* aliasing rule, only goes one way: it is not guaranteed to be safe to use int32_t* to read a __m256. It might not even be safe to use float* to read a __m256. Just like it's not safe to do char buf[1024]; int *p = (int*)buf;.

通过char*读/写可以为任何别名,但是当您具有char object 时,严格混叠确实使UB可以通过其他类型读取它. (我不确定x86上的主要实现是否确实定义了该行为,但是您不需要依赖它,因为它们将4个字节的memcpy优化为int32_t.您可以并且应该使用表示来自char[]缓冲区的未对齐负载,因为允许使用更大类型的自动矢量化假定int16_t*的2字节对齐,并且如果没有,则使代码失败:

Reading/writing through a char* can alias anything, but when you have a char object, strict-aliasing does make it UB to read it through other types. (I'm not sure if the major implementations on x86 do define that behaviour, but you don't need to rely on it because they optimize away memcpy of 4 bytes into an int32_t. You can and should use memcpy to express an unaligned load from a char[] buffer, because auto-vectorization with a wider type is allowed to assume 2-byte alignment for int16_t*, and make code that fails if it's not: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?)

要插入/提取矢量元素,请使用shuffle内部函数,SSE2 _mm_insert_epi16/_mm_extract_epi16或SSE4.1 insert/_mm_extract_epi8/32/64.对于float,没有标量float应使用的插入/提取内在函数.

To insert/extract vector elements, use shuffle intrinsics, SSE2 _mm_insert_epi16 / _mm_extract_epi16 or SSE4.1 insert / _mm_extract_epi8/32/64. For float, there are no insert/extract intrinsics that you should use with scalar float.

或存储到数组中并读取该数组. (打印__m128i变量).实际上,这确实优化了矢量提取指令.

Or store to an array and read the array. (print a __m128i variable). This does actually optimize away to vector extract instructions.

GNU C矢量语法为矢量提供了[]运算符,例如__m256 v = ...; v[3] = 1.25;. MSVC将向量类型定义为带有.m128_f32[]成员的并集,以进行按元素访问.

GNU C vector syntax provides the [] operator for vectors, like __m256 v = ...; v[3] = 1.25;. MSVC defines vector types as a union with a .m128_f32[] member for per-element access.

有一些包装器库,例如 Agner Fog(已获得GPL许可)矢量类库,它们提供了向量类型的可移植operator[]重载以及运算符+/-/*/<<等.这非常好,尤其是对于整数类型,其中对于不同的元素宽度具有不同的类型,使v1 + v2使用正确的大小即可. (GNU C本机向量语法对浮点/双精度向量进行语法处理,并将__m128i定义为有符号int64_t的向量,但MSVC不提供基于__m128类型的运算符.)

There are wrapper libraries like Agner Fog's (GPL licensed) Vector Class Library which provide portable operator[] overloads for their vector types, and operator + / - / * / << and so on. It's quite nice, especially for integer types where having different types for different element widths make v1 + v2 work with the right size. (GNU C native vector syntax does that for float/double vectors, and defines __m128i as a vector of signed int64_t, but MSVC doesn't provide operators on the base __m128 types.)

您还可以在向量和某种类型的数组之间使用联合类型操作,这在ISO C99和GNU C ++中是安全的,但在ISO C ++中则不是.我认为这在MSVC中也是安全的,因为我认为他们将__m128定义为普通联合的方式.

You can also use union type-punning between a vector and an array of some type, which is safe in ISO C99, and in GNU C++, but not in ISO C++. I think it's officially safe in MSVC, too, because I think the way they define __m128 as a normal union.

但是,不能保证您会从任何这些元素访问方法中获得有效的代码.不要使用内部内部循环,如果性能很重要,请查看生成的asm.

There's no guarantee you'll get efficient code from any of these element-access methods, though. Do not use inside inner loops, and have a look at the resulting asm if performance matters.

这篇关于硬件SIMD向量指针和相应类型之间的"reinterpret_cast"运算是否是未定义的行为?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆