如何申请__attribute __((排列(32)))为int *? [英] How can I apply __attribute__(( aligned(32))) to an int *?

查看:262
本文介绍了如何申请__attribute __((排列(32)))为int *?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的节目,我需要办理 __属性__((排列(32)))为int * 浮法*
我想是这样,但我不能肯定它会工作。

In my program I need to apply __attribute__(( aligned(32))) to an int * or float * I tried like this but I'm not sure it will work.

int  *rarray __attribute__(( aligned(32)));

我看到但没有找到答案

推荐答案

所以,你要告诉你的指针对准编译器?例如这个函数的所有呼叫者将传递保证对齐指针。无论指针对准静态或本地存储,或指针,他们从C11 aligned_alloc 或POSIX posix_memalign 了。 (如果这些都不可用, _mm_malloc 是一种选择,但免费不能保证是安全的 _mm_malloc 的结果:你需要 _mm_free )。这允许编译器自动向量化未做一堆臃肿code来处理未对齐的投入。

So you want to tell the compiler that your pointers are aligned? e.g. that all callers of this function will pass pointers that are guaranteed to be aligned. Either pointers to aligned static or local storage, or pointers they got from C11 aligned_alloc or POSIX posix_memalign. (If those aren't available, _mm_malloc is one option, but free isn't guaranteed to be safe on _mm_malloc results: you need _mm_free). This allows the compiler to auto-vectorize without making a bunch of bloated code to handle unaligned inputs.

当你手动内部函数向量化,你可以使用 _mm256_loadu_si256 _mm256_load_si256 来告诉编译器内存是否还是ISN 'T一致。沟通对齐信息被加载/存储内在的重点,而不是简单地提领 __ m256i 指针。

When you manually vectorize with intrinsics, you use _mm256_loadu_si256 or _mm256_load_si256 to inform the compiler whether memory is or isn't aligned. Communicating alignment information is the main point of load/store intrinsics, as opposed to simply dereferencing __m256i pointers.

我不知道是否有通知编译器,一个指针指向对准内存(请参阅下面的C ++ 11 alignas ,提到一个可移植的方法其似乎并没有成为这个可用,即使这是一个C ++的问题)。

I'm not sure if there's a portable way to inform the compiler that a pointer points to aligned memory (see below for mention of C++11 alignas, which doesn't seem to be usable for this, even if this was a C++ question).

使用 GNU C __ __属性语法,这似乎是必要使用的typedef 去应用到该属性所指向的类型,而不是指针本身。这绝对更容易输入,更易于阅读,如果你声明 aligned_int 键入什么

With GNU C __attribute__ syntax, it seems to be necessary to use a typedef to get the attribute to apply to the pointed-to type, rather than to the pointer itself. It's definitely easier to type and easier to read if you declare an aligned_int type or something.

typedef __attribute__(( aligned(32)))  int aligned_int;
int my_func(const aligned_int *restrict a, const aligned_int *restrict b) {
    int sum = 0;
    for (int i=0 ; i<1024 ; i++) {
        sum += a[i] - b[i];
    }
    return sum;
}

这<一个href=\"http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%2C%22intel%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22C4TwDgpgJhBmAEB9RBDYwBOBLARgV2AmQApj4UAbLAcwDtpiBmAJgEp354tbhyq7oibsADcAKAD0E0JBgIuPPjVooAzkzZKBUIT3Fjh8VVgBeEAPaxKy6IYC8R0xdjFr23cFYjOU%2BMAAWWKpcwarAWBQU8AAsAHScYADUAIwh8LTm8BTmtNQQGOQAxsB4lBQgWvRQYjW%2BGBBh2MUhtADkvCjFpZEV%2FhAUYNzU8H31ADSOtIUQXLxQ5g1tvAC2aIQFAO59Afnk5LRQ8DhKapISRxCFKHiqMxsz80tGwOb1fv4Y5njU%2FvsVYOZhPlgtw9vUoHhilgcrEaoZliBELA8FNiIUcmFKoJDAAqeqNLDNFATdG0TFuKoeeB4hqYQm8HCseAAbzEnE4hlUeGW8AcAAZxOz4LBXvBiIYsHY%2BfBvFgADzJPnMaIyriJRJM1lCoVcnmJBwoADaWAAuvAALRHY0mwXsgC%2BbPZ9RKGFoRm54gdcMUCKRKMKiBRFIYpMxuPxdKJJIxvHDtKaDM1jo5il1vPgAuTwtF4sUkulsoVSpVsvVSe17LT%2BvI1otVtNts4DqFzrwrvdy09YiAA%22%2C%22compiler%22%3A%22g530%22%2C%22options%22%3A%22-xc%20-O3%22%7D%5D%7D\"相对=nofollow>自动向量化,没有任何膨胀处理未对齐的输入(GCC 5.3 -O3 上godbolt)

this auto-vectorizes without any bloat for handling unaligned inputs (gcc 5.3 with -O3 on godbolt)

    pxor    xmm0, xmm0
    xor     eax, eax
.L2:
    psubd   xmm0, XMMWORD PTR [rsi+rax]
    paddd   xmm0, XMMWORD PTR [rdi+rax]
    add     rax, 16
    cmp     rax, 4096
    jne     .L2          # end of vector loop

    ...   # horizontal sum with psrldq omitted, see the godbolt link if you're curious
    movd    eax, xmm0
    ret

如果没有对齐属性,你会得到标片头/片尾code的大块,这将是即使 -march = Haswell的更糟,使AVX2 code。与更广泛的内部循环。

Without the aligned attribute, you get a big block of scalar intro/outro code, which would be even worse with -march=haswell to make AVX2 code with a wider inner loop.

锵对未对齐投入正常的策略是利用未对齐加载/存储,而不是完全展开片头/片尾循环。无AVX,这意味着负载不能被折叠成存储器的操作数为SSE ​​ALU操作。 <一href=\"http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%2C%22intel%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22C4TwDgpgJhBmAEB9RBDYwBOBLARgV2AmQApj4UAbLAcwDtpiBmAJgEp354tbhyq7oibsADcAKDEB6SfAwQAzpiwBjXlnm0A5LxSq8lCiHgALCBTDdqJiHIA08ed2UQuvKAHsFW3gFs0hDHgAd1NgU0CUclooeBw%2BLBR5KRkcCGUUPHkXIJcPbwdgdzl4MIx3PGpjKKMwd2EbeS5aclloPFUsd1oAOglheB8QRFg8WmViZS7FeIEoIR54ACo5RWxVcntJ2mnKGno5%2FuWFJXWcVngAbzFOTn75PB94AF54AAZxG%2FhYIvhifqwnq94CIuAAeACMr2YABZgVwANTw85XT6fe6PeEvFAAbSwAF14ABaWK4vEfG4AX2uNzkwDwGGa6PEVL6C0Gw1GykQo12swmUzUCyOqxUOk2AqavGFJ14Z0u1NuC3Rzze5M430CfwWAKBIKwEKhsL1iORCtRysx5FJRJJ%2BLV8Cpn1p9MZD2ZYiAAAA%3D%3D%22%2C%22compiler%22%3A%22clang380%22%2C%22options%22%3A%22-xc%20-O3%22%7D%5D%7D\"相对=nofollow>奇怪的是,对齐的属性不会帮助在这种情况下铛 - 3.8:它仍然采用单独的 MOVDQU 负载注意铛的循环更大,因为它默认为4展开,而GCC根本不解开,而不 -funroll-循环(这是由 -fprofile-启用使用)。

Clang's normal strategy for unaligned inputs is to use unaligned loads/stores, instead of fully-unrolled intro/outro loops. Without AVX, this means the loads couldn't be folded into memory operands for SSE ALU operations. Strangely, the aligned attribute doesn't help clang-3.8 in this case: it still uses separate movdqu loads. Note that clang's loop is bigger because it defaults to unrolling by 4, whereas gcc doesn't unroll at all without -funroll-loops (which is enabled by -fprofile-use).

请注意,你不能让 aligned_int 阵列。 (见的sizeof(aligned_int为的讨论,评论),而事实上,它仍然是4,不是32)。 GNU C拒绝与GCC 5.3把它当作一个 INT -with-填充,所以:

Note that you can't make an array of aligned_int. (see comments for discussion of sizeof(aligned_int), and the fact that it's still 4, not 32). GNU C refuses to treat it as an int-with-padding, so with gcc 5.3:

static aligned_int arr[1024];
// error: alignment of array elements is greater than element size
int tmp = sizeof(arr);

铛-3.8编译的,并初始化 TMP 4096

借助 GCC文档声称,使用在结构排列属性的确实的让你做一个数组,并认为这是主要的用例之一。然而,正如@ user3528438在评论中指出,这是的的情况下:你同样的错误尝试申报 aligned_int 的数组时。这一直是自2005年以来的情况下

The gcc docs claim that using the aligned attribute on a struct does let you make an array, and that this is one of the main use-cases. However, as @user3528438 pointed out in comments, this is not the case: you get the same error as when trying to declare an array of aligned_int. This has been the case since 2005.

定义对准本地或静态/全局数组对齐属性应该被应用到整个阵列,而不是每一个元素。

To define aligned local or static/global arrays, the aligned attribute should be applied to the entire array, rather than to every element.

在便携式C ++ 11,你可以用之类的东西 alignas(32)INT myArray的[1024]; 。另请参见与alignas语法挣扎:这似乎只能是对准事情本身,而不是有用的声明该指针指向对齐的内存。 的std ::对齐更像((uintptr_t形式)PTR)及〜63 或东西:强制将一个指针,而不是告诉它已经对准编译器

In portable C++11, you can use things like alignas(32) int myarray[1024];. See also Struggling with alignas syntax: it seems to only be useful for aligning things themselves, not declaring that pointers point to aligned memory. std::align is more like ((uintptr_t)ptr) & ~63 or something: forcibly aligning a pointer rather than telling the compiler it was already aligned.

alignas(32) int foo[1000];  // C++11 syntax, no C11 equivalent
__attribute__((aligned(32))) int foo[1000];  // GNU C

CPP宏可以GNU C之间做出选择有益的 __属性__ 语法和MSVC __ declspec 对齐语法,如果你想便携性。

CPP macros can be useful to choose between GNU C __attribute__ syntax and MSVC __declspec syntax for alignment if you want portability.

例如。与此code,它声明一个局部阵列多个对齐比可以假设为堆栈指针,编译器,以腾出空间,然后堆栈指针来获得对齐指针:

e.g. with this code that declares a local array with more alignment than can be assumed for the stack pointer, the compiler has to make space and then AND the stack pointer to get an aligned pointer:

void foo(int *p);
void bar(void) {
  __attribute__((aligned(32))) int a[1000];
  foo (a);
}

<一个href=\"http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%2C%22intel%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22sourcez%22%3A%22PTCGBsEsHMDtQM4AoDMAmAlAAkrALlgGYD2xA2gIwAMNAugNwBQAbsZACZGlK4EBUABwxNWHLACNQAJySj22AN6MsWAPqrQePFMjiArngCm6pEggxYh9qkwZsvLKEo0qDZV2JYzwxgF9GQAA%22%2C%22compiler%22%3A%22clang380%22%2C%22options%22%3A%22-xc%20-O3%20-std%3Dgnu11%22%7D%5D%7D\"相对=nofollow>编译为(铛-3.8 -O3 -std = gnu11 为X86-64)

compiles to (clang-3.8 -O3 -std=gnu11 for x86-64)

    push    rbp
    mov     rbp, rsp       # stack frame with base pointer since we're doing unpredictable things to rsp
    and     rsp, -32       # 32B-align the stack
    sub     rsp, 4032      # reserve up to 32B more space than needed
    lea     rdi, [rsp]     # this is weird:  mov rdi,rsp  is a shorter insn to set up foo's arg
    call    foo
    mov     rsp, rbp
    pop     rbp
    ret

海湾合作委员会(晚于4.8.2),使显著更大code做一堆额外的工作没有理由,最奇特的推QWORD PTR [r10-8] 一些堆栈存储器复制到堆栈上的另一个地方。 (检查出来的godbolt链接:​​翻盖铿锵至GCC)。

gcc (later than 4.8.2) makes significantly larger code doing a bunch of extra work for no reason, the strangest being push QWORD PTR [r10-8] to copy some stack memory to another place on the stack. (check it out on the godbolt link: flip clang to gcc).

这篇关于如何申请__attribute __((排列(32)))为int *?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆