为什么增速低于用AVX2预期? [英] Why the speedup is lower than expected by using AVX2?

查看:252
本文介绍了为什么增速低于用AVX2预期?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在矢量使用AVX2的内部函数指令矩阵加法的内循环,我也有从的此处。我希望加速比应为5倍,因为几乎4潜伏期在1024迭代超过128次迭代6延迟情况发生,但增速是3倍。因此问题是还有什么在这里,我没有看到。我使用的gcc,在C编码,内部函数,CPU是SKYLAKE微架构6700hq

下面是C和汇编了把内循环。

全局数据:

  INT __attribute __((排列(32)))一[MAX1] [MAX2];
INT __attribute __((对齐(32)))B〔MAX2] [MAX3];
INT __attribute __((对齐(32)))c_result [MAX1] [MAX3];

顺序:

 为(i = 0; I< MAX1,我++)
        为(J = 0; J< MAX2; J ++)
            c_result [I] [J] = a [i] [j]的+ B [I] [J]。.L16:
    MOVL(%R9,RAX%),%EDX //潜伏期:2,吞吐量:0.5号执行单元:4 ALU
    ADDL(%R8,RAX%),%EDX //延迟:不知道,吞吐量:0.5号执行单元:4 ALU
    MOVL%EDX,c_result(RCX%,%RAX)//潜伏期:2,吞吐量:1号执行单元:4 ALU
    addq $ 4%RAX
    cmpq $ 4096%RAX
    JNE .L16

AVX2:

 为(i = 0; I< MAX1,我++){
   为(J = 0; J< MAX2; J + = 8){
      a0_i = _mm256_add_epi32(_mm256_load_si256((__ m256i *)及一个由[i] [j]的),_mm256_load_si256((__ m256i *)和b [I] [J]));
            _mm256_store_si256((__ m256i *)及c_result [I] [J],a0_i);
    }}.L22:
    vmovdqa(RCX%,%RAX),%ymm0 //延迟:3,吞吐量:0.5号执行单元:4 ALU
    vpaddd(%R8,RAX%),%ymm0,%ymm0 //延迟:不知道,吞吐量:0.5号执行单元:3 VEC-ALU
    vmovdqa%ymm0,c_result(%RDX,RAX%)//潜伏期:3,吞吐量1号执行单元:4 ALU
    addq $ 32%RAX
    cmpq $ 4096%RAX
    JNE .L22


解决方案

除了循环计数器,没有循环携带依赖性链。因此,从不同的循环迭代的操作可以在飞行一次。这意味着延迟不是瓶颈,只是(执行单元,和前端(每个时钟4个稠域微指令))可以通过

另外,你的号码是完全疯了。 MOV 负荷不采取4 ALU执行单元!和加载/存储延迟的数字是错误的/没有意义(见最后一节)。

 #标量(串行是打错字了。两个版本的序列,不是平行)
.L16:
    MOVL(%R9,RAX%),%EDX //融合域微指令:1.未定域:装载端口
    ADDL(%R8,RAX%),%EDX //融合域微指令:2未定域:装载端口和任何端口ALU
    MOVL%EDX,c_result(RCX%,%RAX)//融合域微指令:2未定域:存储地址和存储数据端口。 PORT7不能处理2章地址
    addq $ 4%RAX //融合域微指令:1非融合:ALU任何
    cmpq $ 4096%RAX //融合域微指令:0(JCC与融合)
    JNE .L16 //融合域微指令:1非融合:PORT6(predicted-执行的分支)

共7融合域微指令意味着环可以从循环缓冲器中的一个每个迭代2C 发出。 (不是每个1.75c)。由于我们使用的加载,存储和ALU微指令的混合,执行端口是不是瓶颈,只是融合域4范围的问题宽度。每2C两个负载和2c的每一家门店只有一半的吞吐量负载和存储执行单元。

注意2寄存器寻址模式可以在Intel SNB-家人不要微型保险丝。这不是纯粹的负载问题,因为他们是1 UOP即使没有微融合。

的分析是对向量循环相同。 ( vpaddd 有1c的对SKYLAKE微架构一个延迟和几乎所有其他的CPU。该表没有列出延迟列 PADD 与存储器操作数,因为负载的延迟是从添加的等待时间分开,只增加了一个周期涉及寄存器源/目标,在DEP链​​只要加载地址是知道足够远提前时间。)


瓦格纳雾的存储和加载延迟数字是有点假了。他随意把总负荷店往返延迟时间(与存储转发)到负载和商店延迟数。 IDK为什么他作为一个指针追逐试验测得的没有列出加载等待时间(例如重复 MOV(%RSI),RSI%)。这表明你的英特尔SNB-系列CPU有4个周期负载使用的延迟。

我的意思是送他一张纸条有关,但还没有得到解决它。


的是看到的32/4的AVX2加速,即8倍。您的问题的大小只有4096B,这对于尺寸以适应在L1高速缓存的三个阵列足够小。 (编辑:问题是误导性的。所示的循环是嵌套循环的内循环查看评论:显然,即使4K阵列(未4M),OP仍然只看到了3倍加速(与用4M阵列的1.5倍),所以有一些在AVX版瓶颈。)

全部3阵列排列,所以它不是缓存线交叉的
内存操作数,不需要校准(%R8 )。

我对其他的理论似乎并不很可能不是,但你的阵列地址正好4096B互相抵消?从瓦格纳雾的microarch PDF:


  

这是不可能读取和从地址同时写入
  由4字节的倍数隔开


这个例子显示了一个店,然后加载,虽然如此,如果IDK真正解释它。即使内存排序硬件认为加载和存储可能是同一个地址,我不知道为什么这样做,从维持尽可能多的内存OPS停止code,或者为什么它会影响到AVX2 $ C $ ç比标量code更糟。

这是值得尝试的一个额外的128B 256B或什么抵消彼此的数组。

I have vectorized the the inner loop of matrix addition using intrinsics instruction of AVX2, I also have the latency table from here. I expect that speedup should be a factor of 5, because almost 4 latency happens in 1024 iterations over 6 latency in 128 iterations, but the speedup is a factor of 3. so the question is what else is here that I don't see. I'm using gcc, coding in c, intrinsics, CPU is skylake 6700hq

Here is c and assembly out put of the inner loop.

global data:

int __attribute__(( aligned(32))) a[MAX1][MAX2] ;
int __attribute__(( aligned(32))) b[MAX2][MAX3] ;
int __attribute__(( aligned(32))) c_result[MAX1][MAX3] ;

sequential :

for( i = 0 ; i < MAX1 ; i++)
        for(j = 0 ; j < MAX2 ; j++)
            c_result[i][j] = a[i][j] + b[i][j];

.L16:
    movl    (%r9,%rax), %edx           // latency : 2  , throughput : 0.5   number of execution unit : 4 ALU 
    addl    (%r8,%rax), %edx           // latency : dont know , throughput :    0.5     number of execution unit : 4 ALU 
    movl    %edx, c_result(%rcx,%rax)  // latency : 2 , throughput : 1  number of execution unit : 4 ALU 
    addq    $4, %rax
    cmpq    $4096, %rax
    jne .L16

AVX2:

for( i = 0 ; i < MAX1 ; i++){
   for(j = 0 ; j < MAX2 ; j += 8){
      a0_i= _mm256_add_epi32( _mm256_load_si256((__m256i *)&a[i][j]) ,  _mm256_load_si256((__m256i *)&b[i][j])); 
            _mm256_store_si256((__m256i *)&c_result[i][j], a0_i);
    }}

.L22:
    vmovdqa (%rcx,%rax), %ymm0           // latency : 3 , throughput : 0.5      number of execution unit : 4 ALU
    vpaddd  (%r8,%rax), %ymm0, %ymm0     // latency : dont know , throughput : 0.5  number of execution unit : 3 VEC-ALU
    vmovdqa %ymm0, c_result(%rdx,%rax)   // latency : 3 , throughput : 1    number of execution unit : 4 ALU
    addq    $32, %rax
    cmpq    $4096, %rax
    jne .L22

解决方案

Other than the loop counter, there's no loop-carried dependency chain. So operations from different loop iterations can be in flight at once. This means latency isn't the bottleneck, just throughput (of execution units, and the frontend (up to 4 fused-domain uops per clock)).

Also, your numbers are totally insane. mov loads don't take 4 ALU execution units! And the load/store latency numbers are wrong / meaningless (see the last section).

# Scalar  (serial is the wrong word.  Both versions are serial, not parallel)
.L16:
    movl    (%r9,%rax), %edx           // fused-domain uops: 1.  Unfused domain: a load port
    addl    (%r8,%rax), %edx           // fused-domain uops: 2   Unfused domain: a load port and any ALU port
    movl    %edx, c_result(%rcx,%rax)  // fused-domain uops: 2   Unfused domain: store-address and store-data ports.  port7 can't handle 2-reg addresses
    addq    $4, %rax                   // fused-domain uops: 1   unfused: any ALU
    cmpq    $4096, %rax                // fused-domain uops: 0 (fused with jcc)
    jne .L16                           // fused-domain uops: 1   unfused: port6 (predicted-taken branch)

Total: 7 fused-domain uops means the loop can issue from the loop buffer at one iteration per 2c. (not per 1.75c). Since we're using a mix of loads, stores, and ALU uops, execution ports aren't a bottleneck, just the fused-domain 4-wide issue width. Two loads per 2c and one store per 2c is only half throughput of the load and store execution units.

Note that 2-register addressing modes can't micro-fuse on Intel SnB-family. This isn't a problem for pure loads, because they're 1 uop even without micro-fusion.

The analysis is identical for the vector loop. (vpaddd has a latency of 1c on Skylake, and almost every other CPU. The table doesn't list anything in the latency column for padd with a memory operand because the latency of the load is separate from the latency of the add. It only adds one cycle to the dep chain involving the register src/dest, as long as the load address is know far enough ahead of time.)


Agner Fog's store and load latency numbers are kinda bogus, too. He arbitrarily divides the total load-store round trip latency (with store-forwarding) into a latency number for load and for store. IDK why he didn't list load latency as measured by a pointer-chasing test (e.g. repeated mov (%rsi), %rsi). That shows you that Intel SnB-family CPUs have 4 cycle load-use latency.

I meant to send him a note about that, but haven't gotten around to it.


You should be seeing an AVX2 speedup of 32/4, i.e. 8x. Your problem size is only 4096B, which is small enough for three arrays of that size to fit in L1 cache. (EDIT: the question was misleading: the loop shown is the inner loop of a nested loop. See the comments: apparently even with 4k arrays (not 4M), OP was still only seeing a 3x speedup (vs. 1.5x with 4M arrays), so there's some kind of bottleneck in the AVX version.)

All 3 arrays are aligned, so it's not cache-line crossing in the memory operand that doesn't require alignment (%r8).

My other theory on that doesn't seem very likely either, but are your array addresses offset from each other by exactly 4096B? From Agner Fog's microarch PDF:

It is not possible to read and write simultaneously from addresses that are spaced by a multiple of 4 Kbytes

The example shows a store then load, though, so IDK if that truly explains it. Even if the memory-ordering hardware thinks the load and store might be to the same address, I'm not sure why that would stop the code from sustaining as many memory ops, or why it would affect the AVX2 code worse than the scalar code.

It's worth trying offsetting your arrays from each other by an extra 128B or 256B or something.

这篇关于为什么增速低于用AVX2预期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆