如何解决AVX加载/存储操作的32字节对齐问题? [英] How to solve the 32-byte-alignment issue for AVX load/store operations?

查看:668
本文介绍了如何解决AVX加载/存储操作的32字节对齐问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 ymm 寄存器时遇到了对齐问题,其中一些代码片段对我来说似乎很好。这里是一个最小的工作示例:

  #include< iostream& 
#include< immintrin.h>

inline void ones(float * a)
{
__m256 out_aligned = _mm256_set1_ps(1.0f);
_mm256_store_ps(a,out_aligned);
}

int main()
{
size_t ss = 8;
float * a = new float [ss];
ones(a);

delete [] a;

std :: cout<< 都好! << std :: endl;
return 0;
}

当然, sizeof(float)是我的架构上的 4 Intel Xeon CPU E5-2650 v2 @ 2.60GHz ),我正在编译 gcc 使用 -O3 -march = native 标志。当然,错误消失与未对齐的内存访问,即指定 _mm256_storeu_ps 。我也没有这个问题在 xmm 寄存器,即

  inline void ones_sse(float * a)
{
__m128 out_aligned = _mm_set1_ps(1.0f);
_mm_store_ps(a,out_aligned);
}

我做任何愚蠢的事吗?

解决方案

标准分配器可能只对齐到最宽的标准




  • strong> aligned_alloc :ISO C11,并且在一些但不是所有的C ++编译器中可用。它不是任何ISO C ++标准的一部分,只有C11。 (评论者报告它在MSVC ++中不可用,但请参阅最佳跨平台方法


  • posix_memalign :部分的POSIX 2001,而不是任何ISO C或C ++标准。 Clunky原型/界面 aligned_alloc




  #include< stdlib.h> 
int posix_memalign(void ** memptr,size_t alignment,size_t size); // POSIX 2001
void * aligned_alloc(size_t alignment,size_t size); // C11(not C ++)




  • _mm_malloc :可在任何平台上使用 _mm_whatever_ps ,但不能将指针传递给 free 。在许多C和C ++实现上, _mm_free free 是兼容的,但不能保证是可移植的。


  • 在C ++ 11及更高版本中:使用<$ c(在运行时会失败, $ c> alignas(32)float avx_array [1234] 作为struct /类成员的第一个成员(或直接在平面数组上),因此该类型的静态和自动存储对象将具有32B对齐。 std :: aligned_storage 文档有这个技术的例子来解释 std :: aligned_storage 是什么。



    对于动态分配的存储(如 std :: vector< my_class_with_aligned_member_array> ),请参阅使std :: vector分配对齐的内存





b
$ b

最后,最后一个选项是那么糟糕,它甚至不是列表的一部分:分配一个更大的缓冲区,并添加do p + = 31; p& =〜31ULL 。由于在支持Intel _mm256 内联函数的每个平台上都可以使用对齐分配函数,因此太多的缺点(难以释放,浪费内存)值得讨论。



需要使用 _mm_free 而不是可能存在于 _mm_malloc 之上的一个简单的旧 malloc 使用此技术。


I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example:

#include <iostream> 
#include <immintrin.h>

inline void ones(float *a)
{
     __m256 out_aligned = _mm256_set1_ps(1.0f);
     _mm256_store_ps(a,out_aligned);
}

int main()
{
     size_t ss = 8;
     float *a = new float[ss];
     ones(a);

     delete [] a;

     std::cout << "All Good!" << std::endl;
     return 0;
}

Certainly, sizeof(float) is 4 on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I'm compiling with gcc using -O3 -march=native flags. Of course the error goes away with unaligned memory access i.e. specifying _mm256_storeu_ps. I also do not have this problem on xmm registers, i.e.

inline void ones_sse(float *a)
{
     __m128 out_aligned = _mm_set1_ps(1.0f);
     _mm_store_ps(a,out_aligned);
}

Am I doing anything foolish? what is the work-around for this?

解决方案

The standard allocators are probably only aligning to 8B (the width of the widest standard type), or maybe 16B.

Options:

  • aligned_alloc: ISO C11, and available in some but not all C++ compilers. It's not part of any ISO C++ standard, only C11. (commenters report it's unavailable in MSVC++, but see best cross-platform method to get aligned memory for a viable #ifdef for Windows).

  • posix_memalign: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared to aligned_alloc.

#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size);  // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size);                // C11 (not C++)

  • _mm_malloc: Available on any platform where _mm_whatever_ps is available, but you can't pass pointers from it to free. On many C and C++ implementations _mm_free and free are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.)

  • In C++11 and later: use alignas(32) float avx_array[1234] as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage documentation has an example of this technique to explain what std::aligned_storage does.

    This doesn't actually work for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>), see Making std::vector allocate aligned memory.


And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and add do p+=31; p&=~31ULL with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256 intrinsics. But there are even library functions that will help you do this, IIRC.

The requirement to use _mm_free instead of free probably exists to for the possibility of implementing _mm_malloc on top of a plain old malloc using this technique.

这篇关于如何解决AVX加载/存储操作的32字节对齐问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆