如何解决AVX加载/存储操作的32字节对齐问题? [英] How to solve the 32-byte-alignment issue for AVX load/store operations?
问题描述
我在使用 ymm
寄存器时遇到了对齐问题,其中一些代码片段对我来说似乎很好。这里是一个最小的工作示例:
#include< iostream&
#include< immintrin.h>
inline void ones(float * a)
{
__m256 out_aligned = _mm256_set1_ps(1.0f);
_mm256_store_ps(a,out_aligned);
}
int main()
{
size_t ss = 8;
float * a = new float [ss];
ones(a);
delete [] a;
std :: cout<< 都好! << std :: endl;
return 0;
}
当然, sizeof(float)
是我的架构上的 4
( Intel Xeon CPU E5-2650 v2 @ 2.60GHz ),我正在编译 gcc
使用 -O3 -march = native
标志。当然,错误消失与未对齐的内存访问,即指定 _mm256_storeu_ps
。我也没有这个问题在 xmm
寄存器,即
inline void ones_sse(float * a)
{
__m128 out_aligned = _mm_set1_ps(1.0f);
_mm_store_ps(a,out_aligned);
}
我做任何愚蠢的事吗?
标准分配器可能只对齐到最宽的标准
-
aligned_alloc
:ISO C11,并且在一些但不是所有的C ++编译器中可用。它不是任何ISO C ++标准的一部分,只有C11。 (评论者报告它在MSVC ++中不可用,但请参阅最佳跨平台方法 -
posix_memalign
:部分的POSIX 2001,而不是任何ISO C或C ++标准。 Clunky原型/界面aligned_alloc
。
#include< stdlib.h>
int posix_memalign(void ** memptr,size_t alignment,size_t size); // POSIX 2001
void * aligned_alloc(size_t alignment,size_t size); // C11(not C ++)
-
_mm_malloc
:可在任何平台上使用_mm_whatever_ps
,但不能将指针传递给free
。在许多C和C ++实现上,_mm_free
和free
是兼容的,但不能保证是可移植的。 -
在C ++ 11及更高版本中:使用<$ c(在运行时会失败, $ c> alignas(32)float avx_array [1234] 作为struct /类成员的第一个成员(或直接在平面数组上),因此该类型的静态和自动存储对象将具有32B对齐。
std :: aligned_storage
文档有这个技术的例子来解释std :: aligned_storage
是什么。
对于动态分配的存储(如
std :: vector< my_class_with_aligned_member_array>
),请参阅使std :: vector分配对齐的内存。
b
$ b
最后,最后一个选项是那么糟糕,它甚至不是列表的一部分:分配一个更大的缓冲区,并添加do p + = 31; p& =〜31ULL
。由于在支持Intel _mm256
内联函数的每个平台上都可以使用对齐分配函数,因此太多的缺点(难以释放,浪费内存)值得讨论。
需要使用 _mm_free
而不是可能存在于
_mm_malloc
之上的一个简单的旧 malloc
使用此技术。
I am having alignment issue while using ymm
registers, with some snippets of code that seems fine to me. Here is a minimal working example:
#include <iostream>
#include <immintrin.h>
inline void ones(float *a)
{
__m256 out_aligned = _mm256_set1_ps(1.0f);
_mm256_store_ps(a,out_aligned);
}
int main()
{
size_t ss = 8;
float *a = new float[ss];
ones(a);
delete [] a;
std::cout << "All Good!" << std::endl;
return 0;
}
Certainly, sizeof(float)
is 4
on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I'm compiling with gcc
using -O3 -march=native
flags. Of course the error goes away with unaligned memory access i.e. specifying _mm256_storeu_ps
. I also do not have this problem on xmm
registers, i.e.
inline void ones_sse(float *a)
{
__m128 out_aligned = _mm_set1_ps(1.0f);
_mm_store_ps(a,out_aligned);
}
Am I doing anything foolish? what is the work-around for this?
The standard allocators are probably only aligning to 8B (the width of the widest standard type), or maybe 16B.
Options:
aligned_alloc
: ISO C11, and available in some but not all C++ compilers. It's not part of any ISO C++ standard, only C11. (commenters report it's unavailable in MSVC++, but see best cross-platform method to get aligned memory for a viable#ifdef
for Windows).posix_memalign
: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared toaligned_alloc
.
#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size); // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size); // C11 (not C++)
_mm_malloc
: Available on any platform where_mm_whatever_ps
is available, but you can't pass pointers from it tofree
. On many C and C++ implementations_mm_free
andfree
are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.)In C++11 and later: use
alignas(32) float avx_array[1234]
as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment.std::aligned_storage
documentation has an example of this technique to explain whatstd::aligned_storage
does.This doesn't actually work for dynamically-allocated storage (like a
std::vector<my_class_with_aligned_member_array>
), see Making std::vector allocate aligned memory.
And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and add do p+=31; p&=~31ULL
with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256
intrinsics. But there are even library functions that will help you do this, IIRC.
The requirement to use _mm_free
instead of free
probably exists to for the possibility of implementing _mm_malloc
on top of a plain old malloc
using this technique.
这篇关于如何解决AVX加载/存储操作的32字节对齐问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!