结合__restrict__和__attribute __((aligned(32))) [英] Combining __restrict__ and __attribute__((aligned(32)))

查看:158
本文介绍了结合__restrict__和__attribute __((aligned(32)))的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想确保gcc知道:

  1. 指针指向不重叠的内存块
  2. 指针具有32个字节的对齐方式

以下内容正确吗?

template<typename T, typename T2>
void f(const  T* __restrict__ __attribute__((aligned(32))) x,
       T2* __restrict__ __attribute__((aligned(32))) out) {}

谢谢.

更新:

我尝试使用一次读取和大量写入操作来使cpu端口饱和以进行写入.我希望通过调整动作使性能提升更加显着.

I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.

但是装配体仍然使用未对齐的移动,而不是对齐的移动.

But the assembly still uses unaligned moves instead of aligned moves.

代码(也在 nol%

Code (also at godbolt.org)

int square(const  float* __restrict__ __attribute__((aligned(32))) x,
           const int size,
           float* __restrict__ __attribute__((aligned(32))) out0,
           float* __restrict__ __attribute__((aligned(32))) out1,
           float* __restrict__ __attribute__((aligned(32))) out2,
           float* __restrict__ __attribute__((aligned(32))) out3,
           float* __restrict__ __attribute__((aligned(32))) out4) {
    for (int i = 0; i < size; ++i) {
        out0[i] = x[i];
        out1[i] = x[i] * x[i];
        out2[i] = x[i] * x[i] * x[i];
        out3[i] = x[i] * x[i] * x[i] * x[i];
        out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
    }
}

使用gcc 8.2和"-march = haswell -O3"编译的程序集 它充满了vmovup,这是未对齐的动作.

Assembly compiled with gcc 8.2 and "-march=haswell -O3" It is full of vmovups, which are unaligned moves.

.L3:
        vmovups ymm1, YMMWORD PTR [rbx+rax]
        vmulps  ymm0, ymm1, ymm1
        vmovups YMMWORD PTR [r14+rax], ymm0
        vmulps  ymm0, ymm1, ymm0
        vmovups YMMWORD PTR [r15+rax], ymm0
        vmulps  ymm0, ymm1, ymm0
        vmovups YMMWORD PTR [r12+rax], ymm0
        vmulps  ymm0, ymm1, ymm0
        vmovups YMMWORD PTR [rbp+0+rax], ymm0
        add     rax, 32
        cmp     rax, rdx
        jne     .L3
        and     r13d, -8
        vzeroupper

即使是沙桥,行为也相同:

Same behavior even for sandybridge:

.L3:
        vmovups xmm2, XMMWORD PTR [rbx+rax]
        vinsertf128     ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
        vmulps  ymm0, ymm1, ymm1
        vmovups XMMWORD PTR [r14+rax], xmm0
        vextractf128    XMMWORD PTR [r14+16+rax], ymm0, 0x1
        vmulps  ymm0, ymm1, ymm0
        vmovups XMMWORD PTR [r13+0+rax], xmm0
        vextractf128    XMMWORD PTR [r13+16+rax], ymm0, 0x1
        vmulps  ymm0, ymm1, ymm0
        vmovups XMMWORD PTR [r12+rax], xmm0
        vextractf128    XMMWORD PTR [r12+16+rax], ymm0, 0x1
        vmulps  ymm0, ymm1, ymm0
        vmovups XMMWORD PTR [rbp+0+rax], xmm0
        vextractf128    XMMWORD PTR [rbp+16+rax], ymm0, 0x1
        add     rax, 32
        cmp     rax, rdx
        jne     .L3
        and     r15d, -8
        vzeroupper

使用另外的代替乘法(

Using addition instead of multiplication (godbolt). Still unaligned moves.

推荐答案

否,使用float *__attribute__((aligned(32))) x意味着指针本身是存储在中,而不是指向对齐的内存. 1

No, using float *__attribute__((aligned(32))) x means that the pointer itself is stored in aligned memory, not pointing to aligned memory.1

有一种方法可以做到这一点,但是它仅对gcc有用,对clang或ICC无效.

There is a way to do this, but it only helps for gcc, not clang or ICC.

请参见如何将__attribute __((aligned(32)))应用于int *?了解有关,适用于GCC.

See How to tell GCC that a pointer argument is always double-word-aligned? for __builtin_assume_aligned which works on all GNU C compatible compilers, and How can I apply __attribute__(( aligned(32))) to an int *? for more details about __attribute__((aligned(32))), which does work for GCC.

我使用__restrict而不是__restrict__ ,因为C99 restrict的C ++扩展名可移植到所有主流x86 C ++编译器,包括MSVC.

I used __restrict instead of __restrict__ because that C++ extension name for C99 restrict is portable to all the mainstream x86 C++ compilers, including MSVC.

typedef float aligned32_float __attribute__((aligned(32)));

void prod(const aligned32_float  * __restrict x,
          const aligned32_float  * __restrict y,
          int size,
          aligned32_float* __restrict out0)
{
    size &= -16ULL;

#if 0   // this works for clang, ICC, and GCC
    x = (const float*)__builtin_assume_aligned(x, 32);  // have to cast the result in C++
    y = (const float*)__builtin_assume_aligned(y, 32);
    out0 = (float*)__builtin_assume_aligned(out0, 32);
#endif

    for (int i = 0; i < size; ++i) {
        out0[i] = x[i] * y[i];  // auto-vectorized with a memory operand for mulps
      // note clang using two separate movups loads
      // instead of a memory operand for mulps
    }
}

( GCC,铛,和Godbolt编译器资源管理器上的ICC输出).

只要有编译时对齐保证,GCC和clang将使用movaps/vmovaps而不是ups. (与从未使用movaps进行加载/存储的MSVC和ICC不同,对于在Core2/K10或更旧版本上运行的任何内容,它都错过了优化).正如您所注意到的,它会将-mavx256-split-unaligned-load/store效果应用于除Haswell之外的其他调音(为什么gcc不能将_mm256_loadu_pd解析为单个vmovupd吗?).,这是您的语法不起作用的另一个线索.

GCC and clang will use movaps / vmovaps instead of ups any time it has a compile-time alignment guarantee. (Unlike MSVC and ICC which never use movaps for loads/stores, a missed optimization for anything that runs on Core2 / K10 or older). And as you noticed, it's applying the -mavx256-split-unaligned-load/store effects for tunings other than Haswell (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)., another clue that your syntax didn't work.

vmovups在对齐的内存上使用时不会出现性能问题;当地址在运行时对齐时,它在所有支持AVX的CPU上的性能与vmovaps相同.因此在实践中,您的-march=haswell输出没有真正的问题.只有在Nehalem和Bulldozer之前的较旧的CPU总是将movups解码为多个微指令.

vmovups is not a performance problem when used on aligned memory; it performs identically to vmovaps on all AVX-supporting CPUs when the address is aligned at runtime. So in practice there's no real problem with your -march=haswell output. Only older CPUs, before Nehalem and Bulldozer, always decoded movups to multiple uops.

(近日)告诉编译器对齐保证的真正好处是,编译器有时会为启动/清理循环发出额外的代码以达到对齐边界.要么没有AVX,编译器就无法将负载折叠到mulps的内存操作数中,除非它是对齐的.

The real benefit (these days) to telling the compiler about alignment guarantees is that compilers sometimes emit extra code for startup/cleanup loops to reach an alignment boundary. Or without AVX, compilers can't fold a load into a memory operand for mulps unless it's aligned.

一个很好的测试案例是out0[i] = x[i] * y[i],只需一次加载结果即可.out0[i] *= x[i].知道对齐方式会启用movaps/mulps xmm0, [rsi],否则为2x movups + mulps.您甚至可以在ICC或MSVC之类的编译器上检查此优化,即使它们 do 知道有对齐保证也使用movups,但是当它们可以折叠时仍会生成需要对齐的代码ALU操作中的负载.

A good test case for this is out0[i] = x[i] * y[i], where the load result is only needed once. Or out0[i] *= x[i]. Knowing alignment enables movaps/mulps xmm0, [rsi], otherwise it's 2x movups + mulps. You can check for this optimization even on compilers like ICC or MSVC, which use movups even when they do know they have an alignment guarantee, but they will still make alignment-required code when they can fold a load into an ALU operation.

看来__builtin_assume_aligned是(对于GNU C编译器)唯一真正可移植的方法.您可以像将指针传递给struct aligned_floats { alignas(32) float f[8]; };那样进行破解,但是使用起来很麻烦,并且除非您实际上通过该类型的对象访问内存,否则编译器将无法假定对齐. (例如,将指向该指针的指针投射回float *

It seems __builtin_assume_aligned is the only really portable (to GNU C compilers) way to do this. You can do hacks like passing pointers to struct aligned_floats { alignas(32) float f[8]; };, but that's just cumbersome to use, and unless you actually access memory through objects of that type, it doesn't get compilers to assume alignment. (e.g. casting a pointer to that back to float *

我尝试使用一次读取和大量写入操作来饱和要写入的cpu端口.

I try to use one read and lots of write to saturate the cpu ports for writing.

使用4个以上的输出流可能会导致高速缓存中的更多冲突未命中而受到伤害.例如,Skylake的L2缓存只有4路.但是L1d是8路的,因此对于小型缓冲区,您可能还可以.

Using more than 4 output streams can hurt by resulting in more conflict misses in the cache. Skylake's L2 cache is only 4-way, for example. But L1d is 8-way so you're probably ok for small buffers.

如果要使存储端口uop吞吐量达到饱和,请使用较窄的存储(例如标量),而不要使用每个uop需要更多带宽的SIMD存储.在提交到L1d之前,可以将存储在同一缓存行中的背对背存储合并到存储缓冲区中,因此这取决于您要测试的内容.

If you want to saturate the store port uop throughput, use narrower stores (e.g. scalar), not wide SIMD stores that need more bandwidth per uop. Back-to-back stores to the same cache line may be able to merge in the store buffer before committing to L1d, so it depends what you want to test.

半相关的:像c[i] = a[i]+b[i]或STREAM triad这样的2倍负载+ 1倍存储内存访问模式将最接近

Semi-related: a 2x load + 1x store memory access pattern like c[i] = a[i]+b[i] or STREAM triad will come closest to maxing out total L1d cache load+store bandwidth on Intel Sandybridge-family CPUs. On SnB/IvB, 256-bit vectors take 2 cycles per load/store, leaving time for store-address uops to use the AGUs on ports 2 or 3 during the 2nd cycle of a load. On Haswell and later (256-bit wide load/store ports), the stores need to use a non-indexed addressing mode so they can use the simple-addressing-mode store AGU on port 7.

但是AMD CPU每个时钟最多可以执行2个内存操作,最多只能有一个存储,因此它们将通过复制和操作存储=加载模式来最大程度地发挥作用.

But AMD CPUs can do up-to-2 memory ops per clock, with at most one being a store, so they'd max out with a copy-and-operate stores = loads pattern.

顺便说一句,英特尔最近宣布了Sunny Cove(Ice Lake的前身),它将拥有

BTW, Intel recently announced Sunny Cove (successor to Ice Lake), which will have 2x load + 2x store throughput per clock, a 2nd vector shuffle ALU, and 5-wide issue/rename. So that's fun! Compilers will need to unroll loops by at least 2 to not bottleneck on 1-per-clock loop branches.

脚注1 :这就是为什么(如果不使用AVX进行编译)会出现警告,并且gcc会省略and rsp,-32,因为它假定RSP已经对齐. (它实际上并没有溢出任何YMM规则,因此无论如何它都应该对此进行了优化,但是gcc在使用局部对齐或自动向量化创建的对象时已经存在此错过优化的bug了一段时间.)

Footnote 1: That's why (if you compile without AVX), you get a warning, and gcc omits an and rsp,-32 because it assumes RSP is already aligned. (It doesn't actually spill any YMM regs, so it should have optimized this out anyway, but gcc has had this missed-optimization bug for a while with locals or auto-vectorization-created objects with extra alignment.)

<source>:4:6: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6

这篇关于结合__restrict__和__attribute __((aligned(32)))的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆