如何申请__attribute __((排列(32)))为int *? [英] How can I apply __attribute__(( aligned(32))) to an int *?
问题描述
在我的节目,我需要办理 __属性__((排列(32)))
到为int *
或浮法*
我想是这样,但我不能肯定它会工作。
In my program I need to apply __attribute__(( aligned(32)))
to an int *
or float *
I tried like this but I'm not sure it will work.
int *rarray __attribute__(( aligned(32)));
我看到这但没有找到答案
推荐答案
所以,你要告诉你的指针对准编译器?例如这个函数的所有呼叫者将传递保证对齐指针。无论指针对准静态或本地存储,或指针,他们从C11 aligned_alloc
或POSIX posix_memalign
了。 (如果这些都不可用, _mm_malloc
是一种选择,但免费
不能保证是安全的 _mm_malloc
的结果:你需要 _mm_free
)。这允许编译器自动向量化未做一堆臃肿code来处理未对齐的投入。
So you want to tell the compiler that your pointers are aligned? e.g. that all callers of this function will pass pointers that are guaranteed to be aligned. Either pointers to aligned static or local storage, or pointers they got from C11 aligned_alloc
or POSIX posix_memalign
. (If those aren't available, _mm_malloc
is one option, but free
isn't guaranteed to be safe on _mm_malloc
results: you need _mm_free
). This allows the compiler to auto-vectorize without making a bunch of bloated code to handle unaligned inputs.
当你手动内部函数向量化,你可以使用 _mm256_loadu_si256
或 _mm256_load_si256
来告诉编译器内存是否还是ISN 'T一致。沟通对齐信息被加载/存储内在的重点,而不是简单地提领 __ m256i
指针。
When you manually vectorize with intrinsics, you use _mm256_loadu_si256
or _mm256_load_si256
to inform the compiler whether memory is or isn't aligned. Communicating alignment information is the main point of load/store intrinsics, as opposed to simply dereferencing __m256i
pointers.
我不知道是否有通知编译器,一个指针指向对准内存(请参阅下面的C ++ 11 alignas
,提到一个可移植的方法其似乎并没有成为这个可用,即使这是一个C ++的问题)。
I'm not sure if there's a portable way to inform the compiler that a pointer points to aligned memory (see below for mention of C++11 alignas
, which doesn't seem to be usable for this, even if this was a C++ question).
使用 GNU C __ __属性
语法,这似乎是必要使用的typedef
去应用到该属性所指向的类型,而不是指针本身。这绝对更容易输入,更易于阅读,如果你声明 aligned_int
键入什么
With GNU C __attribute__
syntax, it seems to be necessary to use a typedef
to get the attribute to apply to the pointed-to type, rather than to the pointer itself. It's definitely easier to type and easier to read if you declare an aligned_int
type or something.
typedef __attribute__(( aligned(32))) int aligned_int;
int my_func(const aligned_int *restrict a, const aligned_int *restrict b) {
int sum = 0;
for (int i=0 ; i<1024 ; i++) {
sum += a[i] - b[i];
}
return sum;
}
这<一个href=\"http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%2C%22intel%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22C4TwDgpgJhBmAEB9RBDYwBOBLARgV2AmQApj4UAbLAcwDtpiBmAJgEp354tbhyq7oibsADcAKAD0E0JBgIuPPjVooAzkzZKBUIT3Fjh8VVgBeEAPaxKy6IYC8R0xdjFr23cFYjOU%2BMAAWWKpcwarAWBQU8AAsAHScYADUAIwh8LTm8BTmtNQQGOQAxsB4lBQgWvRQYjW%2BGBBh2MUhtADkvCjFpZEV%2FhAUYNzU8H31ADSOtIUQXLxQ5g1tvAC2aIQFAO59Afnk5LRQ8DhKapISRxCFKHiqMxsz80tGwOb1fv4Y5njU%2FvsVYOZhPlgtw9vUoHhilgcrEaoZliBELA8FNiIUcmFKoJDAAqeqNLDNFATdG0TFuKoeeB4hqYQm8HCseAAbzEnE4hlUeGW8AcAAZxOz4LBXvBiIYsHY%2BfBvFgADzJPnMaIyriJRJM1lCoVcnmJBwoADaWAAuvAALRHY0mwXsgC%2BbPZ9RKGFoRm54gdcMUCKRKMKiBRFIYpMxuPxdKJJIxvHDtKaDM1jo5il1vPgAuTwtF4sUkulsoVSpVsvVSe17LT%2BvI1otVtNts4DqFzrwrvdy09YiAA%22%2C%22compiler%22%3A%22g530%22%2C%22options%22%3A%22-xc%20-O3%22%7D%5D%7D\"相对=nofollow>自动向量化,没有任何膨胀处理未对齐的输入(GCC 5.3 -O3
上godbolt)
this auto-vectorizes without any bloat for handling unaligned inputs (gcc 5.3 with -O3
on godbolt)
pxor xmm0, xmm0
xor eax, eax
.L2:
psubd xmm0, XMMWORD PTR [rsi+rax]
paddd xmm0, XMMWORD PTR [rdi+rax]
add rax, 16
cmp rax, 4096
jne .L2 # end of vector loop
... # horizontal sum with psrldq omitted, see the godbolt link if you're curious
movd eax, xmm0
ret
如果没有对齐属性,你会得到标片头/片尾code的大块,这将是即使 -march = Haswell的
更糟,使AVX2 code。与更广泛的内部循环。
Without the aligned attribute, you get a big block of scalar intro/outro code, which would be even worse with -march=haswell
to make AVX2 code with a wider inner loop.
锵对未对齐投入正常的策略是利用未对齐加载/存储,而不是完全展开片头/片尾循环。无AVX,这意味着负载不能被折叠成存储器的操作数为SSE ALU操作。 <一href=\"http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%2C%22intel%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22C4TwDgpgJhBmAEB9RBDYwBOBLARgV2AmQApj4UAbLAcwDtpiBmAJgEp354tbhyq7oibsADcAKDEB6SfAwQAzpiwBjXlnm0A5LxSq8lCiHgALCBTDdqJiHIA08ed2UQuvKAHsFW3gFs0hDHgAd1NgU0CUclooeBw%2BLBR5KRkcCGUUPHkXIJcPbwdgdzl4MIx3PGpjKKMwd2EbeS5aclloPFUsd1oAOglheB8QRFg8WmViZS7FeIEoIR54ACo5RWxVcntJ2mnKGno5%2FuWFJXWcVngAbzFOTn75PB94AF54AAZxG%2FhYIvhifqwnq94CIuAAeACMr2YABZgVwANTw85XT6fe6PeEvFAAbSwAF14ABaWK4vEfG4AX2uNzkwDwGGa6PEVL6C0Gw1GykQo12swmUzUCyOqxUOk2AqavGFJ14Z0u1NuC3Rzze5M430CfwWAKBIKwEKhsL1iORCtRysx5FJRJJ%2BLV8Cpn1p9MZD2ZYiAAAA%3D%3D%22%2C%22compiler%22%3A%22clang380%22%2C%22options%22%3A%22-xc%20-O3%22%7D%5D%7D\"相对=nofollow>奇怪的是,对齐的属性不会帮助在这种情况下铛 - 3.8:它仍然采用单独的 MOVDQU
负载注意铛的循环更大,因为它默认为4展开,而GCC根本不解开,而不 -funroll-循环
(这是由 -fprofile-启用使用
)。
Clang's normal strategy for unaligned inputs is to use unaligned loads/stores, instead of fully-unrolled intro/outro loops. Without AVX, this means the loads couldn't be folded into memory operands for SSE ALU operations. Strangely, the aligned attribute doesn't help clang-3.8 in this case: it still uses separate movdqu
loads. Note that clang's loop is bigger because it defaults to unrolling by 4, whereas gcc doesn't unroll at all without -funroll-loops
(which is enabled by -fprofile-use
).
请注意,你不能让 aligned_int
阵列。 (见的sizeof(aligned_int为的讨论,评论)
,而事实上,它仍然是4,不是32)。 GNU C拒绝与GCC 5.3把它当作一个 INT
-with-填充,所以:
Note that you can't make an array of aligned_int
. (see comments for discussion of sizeof(aligned_int)
, and the fact that it's still 4, not 32). GNU C refuses to treat it as an int
-with-padding, so with gcc 5.3:
static aligned_int arr[1024];
// error: alignment of array elements is greater than element size
int tmp = sizeof(arr);
铛-3.8编译的,并初始化 TMP
4096
借助 GCC文档声称,使用在结构排列
属性的确实的让你做一个数组,并认为这是主要的用例之一。然而,正如@ user3528438在评论中指出,这是的不的情况下:你同样的错误尝试申报 aligned_int
的数组时。这一直是自2005年以来的情况下。
The gcc docs claim that using the aligned
attribute on a struct does let you make an array, and that this is one of the main use-cases. However, as @user3528438 pointed out in comments, this is not the case: you get the same error as when trying to declare an array of aligned_int
. This has been the case since 2005.
定义对准本地或静态/全局数组的对齐
属性应该被应用到整个阵列,而不是每一个元素。
To define aligned local or static/global arrays, the aligned
attribute should be applied to the entire array, rather than to every element.
在便携式C ++ 11,你可以用之类的东西 alignas(32)INT myArray的[1024];
。另请参见与alignas语法挣扎:这似乎只能是对准事情本身,而不是有用的声明该指针指向对齐的内存。 的std ::对齐
更像((uintptr_t形式)PTR)及〜63
或东西:强制将一个指针,而不是告诉它已经对准编译器
In portable C++11, you can use things like alignas(32) int myarray[1024];
. See also Struggling with alignas syntax: it seems to only be useful for aligning things themselves, not declaring that pointers point to aligned memory. std::align
is more like ((uintptr_t)ptr) & ~63
or something: forcibly aligning a pointer rather than telling the compiler it was already aligned.
alignas(32) int foo[1000]; // C++11 syntax, no C11 equivalent
__attribute__((aligned(32))) int foo[1000]; // GNU C
CPP宏可以GNU C之间做出选择有益的 __属性__
语法和MSVC __ declspec
对齐语法,如果你想便携性。
CPP macros can be useful to choose between GNU C __attribute__
syntax and MSVC __declspec
syntax for alignment if you want portability.
例如。与此code,它声明一个局部阵列多个对齐比可以假设为堆栈指针,编译器,以腾出空间,然后和
堆栈指针来获得对齐指针:
e.g. with this code that declares a local array with more alignment than can be assumed for the stack pointer, the compiler has to make space and then AND
the stack pointer to get an aligned pointer:
void foo(int *p);
void bar(void) {
__attribute__((aligned(32))) int a[1000];
foo (a);
}
<一个href=\"http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%2C%22intel%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22PTCGBsEsHMDtQM4AoDMAmAlAAkrALlgGYD2xA2gIwAMNAugNwBQAbsZACZGlK4EBUABwxNWHLACNQAJySj22AN6MsWAPqrQePFMjiArngCm6pEggxYh9qkwZsvLKEo0qDZV2JYzwxgF9GQAA%22%2C%22compiler%22%3A%22clang380%22%2C%22options%22%3A%22-xc%20-O3%20-std%3Dgnu11%22%7D%5D%7D\"相对=nofollow>编译为(铛-3.8 -O3 -std = gnu11
为X86-64)
compiles to (clang-3.8 -O3 -std=gnu11
for x86-64)
push rbp
mov rbp, rsp # stack frame with base pointer since we're doing unpredictable things to rsp
and rsp, -32 # 32B-align the stack
sub rsp, 4032 # reserve up to 32B more space than needed
lea rdi, [rsp] # this is weird: mov rdi,rsp is a shorter insn to set up foo's arg
call foo
mov rsp, rbp
pop rbp
ret
海湾合作委员会(晚于4.8.2),使显著更大code做一堆额外的工作没有理由,最奇特的推QWORD PTR [r10-8]
一些堆栈存储器复制到堆栈上的另一个地方。 (检查出来的godbolt链接:翻盖铿锵至GCC)。
gcc (later than 4.8.2) makes significantly larger code doing a bunch of extra work for no reason, the strangest being push QWORD PTR [r10-8]
to copy some stack memory to another place on the stack. (check it out on the godbolt link: flip clang to gcc).
这篇关于如何申请__attribute __((排列(32)))为int *?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!