差异MSVC和海湾合作委员会之间的性能高度优化的矩阵multplication code [英] Difference in performance between MSVC and GCC for highly optimized matrix multplication code

查看:299
本文介绍了差异MSVC和海湾合作委员会之间的性能高度优化的矩阵multplication code的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到在MSVC(在Windows上)和海湾合作委员会(在Linux上)编译code为Ivy Bridge的系统之间的性能有很大的区别。在code不密集的矩阵乘法。我得到的峰值触发器使用GCC的70%,只有50%的MSVC。我想我可能已经分离出的差异,他们都如何转换以下三个内部函数。

  __ M256 breg0 = _mm256_loadu_ps(和b [8 * I])
_mm256_add_ps(_mm256_mul_ps(arge0,breg0),TMP0)

GCC做到这一点。

  vmovups ymm9,YMMWORD PTR [RAX-256]
vmulps ymm9,ymm0,ymm9
vaddps ymm8,ymm8,ymm9

MSVC做到这一点。

  vmulps ymm1,ymm2,YMMWORD PTR [RAX-256]
vaddps ymm3,ymm1,ymm3

可能有人请,如果给我解释一下,为什么这两个解决方案可以在性能上得到如此大的差异?

尽管使用MSVC少了一个指令它关系负载到MULT,也许这使得它更依赖(也许负载不能做坏了)?我的意思是常春藤桥可以做一AVX负载一AVX MULT和一个AVX在一个时钟周期加入但是这需要每个操作是独立的。

也许问题出在其它地方?你可以看到完整的装配code GCC和MSVC下面最内层的循环。你可以在这里<一看到C ++ code为循环href=\"http://stackoverflow.com/questions/21090873/loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell\">Loop展开实现与常春藤桥和Haswell的

最大吞吐量

G ++ -S -masm =英特尔matrix.cpp -O3 -mavx -fopenmp

  .L4:
    vbroadcastss ymm0,DWORD PTR [RCX + RDX * 4]
    加RDX,1
    添加RAX,256
    vmovups ymm9,YMMWORD PTR [RAX-256]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm8,ymm8,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-224]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm7,ymm7,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-192]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm6,ymm6,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-160]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm5,ymm5,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-128]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm4,ymm4,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-96]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm3,ymm3,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-64]
    vmulps ymm9,ymm0,ymm9
    vaddps ymm2,ymm2,ymm9
    vmovups ymm9,YMMWORD PTR [RAX-32]
    CMP ESI,EDX
    vmulps ymm0,ymm0,ymm9
    vaddps ymm1,ymm1,ymm0
    JG .L4

MSVC / FAC / O2 / OpenMP的/弓:AVX ...

  vbroadcastss ymm2,DWORD PTR [R10]
LEA RAX,QWORD PTR [RAX + 256]
LEA R10,QWORD PTR [R10 + 4]
vmulps ymm1,ymm2,YMMWORD PTR [RAX-320]
vaddps ymm3,ymm1,ymm3
vmulps ymm1,ymm2,YMMWORD PTR [RAX-288]
vaddps ymm4,ymm1,ymm4
vmulps ymm1,ymm2,YMMWORD PTR [RAX-256]
vaddps ymm5,ymm1,ymm5
vmulps ymm1,ymm2,YMMWORD PTR [RAX-224]
vaddps ymm6,ymm1,ymm6
vmulps ymm1,ymm2,YMMWORD PTR [RAX-192]
vaddps ymm7,ymm1,ymm7
vmulps ymm1,ymm2,YMMWORD PTR [RAX-160]
vaddps ymm8,ymm1,ymm8
vmulps ymm1,ymm2,YMMWORD PTR [RAX-128]
vaddps ymm9,ymm1,ymm9
vmulps ymm1,ymm2,YMMWORD PTR [RAX-96]
vaddps ymm10,ymm1,ymm10
十二月RDX
JNE SHORT $ @ LL3 AddDot4x4_

编辑:

我标杆code。通过claculating总浮点运算为 2.0 * N ^ 3 其中n是方阵的宽度和按时间划分与 omp_get_wtime测量()。我重复多次循环。在下面的输出I重复了100次。

从MSVC2012输出上的英特尔至强E5 1620(Ivy Bridge的)涡轮增压所有核3.7 GHz的

 最大GFLOPS = 236.8 =(8级SIMD)*(1 AVX MULT + 1 AVX加)*(4核)* 3.7 GHz的N 64,0.02毫秒,GFLOPS 0.001,GFLOPS / s的23.88,错误0​​.000E + 000,效率/核心40.34%,有效率10.08%,MEM 0.05 MB
ñ128,为0.05 ms,GFLOPS 0.004,GFLOPS / s的84.54,错误0.000E + 000,效率/芯142.81%,35.70效率%,MEM 0.19 MB
ñ192,0.17毫秒,GFLOPS 0.014,GFLOPS / s的85.45,错误0.000E + 000,效率/芯144.34%,36.09效率%,MEM 0.42 MB
ñ256,0.29毫秒,GFLOPS 0.034,GFLOPS / s的114.48,错误0​​.000E + 000,效率/芯193.37%,48.34效率%,MEM 0.75 MB
ñ320,0.59毫秒,GFLOPS 0.066,GFLOPS / s的110.50,错误0.000E + 000,效率/ 186.66的核心%,效率46.67%,MEM 1.17 MB
ñ384,1.39毫秒,GFLOPS 0.113,GFLOPS / s的81.39,错误0.000E + 000,效率/芯137.48%,34.37效率%,MEM 1.69 MB
ñ448,3.27毫秒,GFLOPS 0.180,GFLOPS / s的55.01,错误0.000E + 000,效率/核心92.92%,23.23效率%,MEM 2.30 MB
n个512,3.60毫秒,GFLOPS 0.268,GFLOPS / s的74.63,错误0.000E + 000,效率/核心126.07%,31.52效率%,MEM 3.00 MB
ñ576,3.93毫秒,GFLOPS 0.382,GFLOPS / s的97.24,错误0.000E + 000,效率/芯164.26%,41.07效率%,MEM 3.80 MB
ñ640,5.21毫秒,GFLOPS 0.524,GFLOPS / s的100.60,错误0.000E + 000,效率/芯169.93%,有效率42.48%,MEM 4.69 MB
ñ704,6.73毫秒,GFLOPS 0.698,GFLOPS / s的103.63,错误0.000E + 000,效率/芯175.04%,有效率43.76%,MEM 5.67 MB
ñ768,8.55毫秒,GFLOPS 0.906,GFLOPS / s的105.95,错误0.000E + 000,效率/芯178.98%,44.74效率%,MEM 6.75 MB
ñ832,10.89毫秒,GFLOPS 1.152,GFLOPS / s的105.76,错误0.000E + 000,效率/芯178.65%,有效率44.66%,MEM 7.92 MB
ñ896,13.26毫秒,GFLOPS 1.439,GFLOPS / s的108.48,错误0​​.000E + 000,效率/核心183.25%,45.81效率%,MEM 9.19 MB
ñ960,16.36毫秒,GFLOPS 1.769,GFLOPS / s的108.16,错误0.000E + 000,效率/芯182.70%,有效率45.67%,MEM 10.55 MB
ñ1024,MS 17.74,GFLOPS 2.147,GFLOPS / s的121.05,错误0.000E + 000,效率/芯204.47%,51.12效率%,MEM 12.00 MB


解决方案

既然我们已经介绍了对齐问题,我猜那就是:的 http://en.wikipedia.org/wiki/Out-of-order_execution

由于g ++发出独立的加载指令,你的处理器可以重新排序的指示是$ P $对 - 取,将需要同时,还加入和乘法下一个数据。 MSVC在MUL抛出一个指针,使连接到相同的指令负荷和MUL,所以改变指令的执行顺序没有帮助任何东西。

编辑:Intel的服务器(S)今天所有的文档都少生气,所以这里的,为什么乱序执行更多的研究,(部分)的答案

首先,它看起来像你的评论是完全正确的它是可能的MSVC版本乘法指令的脱code分离μ-OPS可以由CPU的乱序引擎进行优化。这里的乐趣的部分是,现代微code时序控制器是可编程的,所以实际的行为是硬件和固件相关的。在生成的程序集的差异似乎是GCC和MSVC每个尝试打不​​同的潜在瓶颈。海湾合作委员会的版本试图给余地乱序引擎(如我们已经介绍过)。但是,MSVC版本最终采取了所谓的微操作融合功能的优势。这是因为,μ-运算退休的限制。管道最终只能退每个时钟周期的3μ-OPS。微操作融合,在特定情况下,需要两个μ-OPS的必须的可以在两个不同的执行单元完成(即内存读取和运算),并将它们捆绑到一个μ-OP大多数的管道。融合μ-运算执行单元分配权之前只分成两个实μ-OPS。执行后,OPS再次融合,让他们退役为一体。

乱序引擎只能看到融合的μ-OP,所以它不能拉运负荷将乘法路程。这将导致管道在等待下一个操作数来完成其乘坐巴士挂起。

所有的链接!!!:
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

<一个href=\"http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf\" rel=\"nofollow\">http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/optimizing_assembly.pdf

http://www.agner.org/optimize/instruction_tables.ods
(注意:Excel抱怨说,这个S preadsheet是部分损坏或以其他方式粗略,风险自担如此开放这似乎并没有恶意,不过,根据我的研究的其余部分,瓦格纳雾是真棒。之后我选择加入到Excel回收工序,我发现它充满了很多很好的数据)

<一个href=\"http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf\" rel=\"nofollow\">http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf

http://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf


很久以后编辑:
哇,出现了一些有趣的更新,这里的讨论。我想我是错了多少管道实际上是受微操作融合。也许有更多的PERF的收益比我从循环条件检查,其中非融合指令允许GCC交错的比较和与上次载体负载和计算步骤是什么?

跳的预期分歧

  vmovups ymm9,YMMWORD PTR [RAX-32]
CMP ESI,EDX
vmulps ymm0,ymm0,ymm9
vaddps ymm1,ymm1,ymm0
JG .L4

I'm seeing a big difference in performance between code compiled in MSVC (on Windows) and GCC (on Linux) for an Ivy Bridge system. The code does dense matrix multiplication. I'm getting 70% of the peak flops with GCC and only 50% with MSVC. I think I may have isolated the difference to how they both convert the following three intrinsics.

__m256 breg0 = _mm256_loadu_ps(&b[8*i])
_mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0)

GCC does this

vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9

MSVC does this

vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm3, ymm1, ymm3

Could somebody please explain to me if and why these two solutions could give such a big difference in performance?

Despite MSVC using one less instruction it ties the load to the mult and maybe that makes it more dependent (maybe the load can't be done out of order)? I mean Ivy Bridge can do one AVX load, one AVX mult, and one AVX add in one clock cycle but this requires each operation to be independent.

Maybe the problem lies elsewhere? You can see the full assembly code for GCC and MSVC for the innermost loop below. You can see the C++ code for the loop here Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

g++ -S -masm=intel matrix.cpp -O3 -mavx -fopenmp

.L4:
    vbroadcastss    ymm0, DWORD PTR [rcx+rdx*4]
    add rdx, 1
    add rax, 256
    vmovups ymm9, YMMWORD PTR [rax-256]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm8, ymm8, ymm9
    vmovups ymm9, YMMWORD PTR [rax-224]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm7, ymm7, ymm9
    vmovups ymm9, YMMWORD PTR [rax-192]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm6, ymm6, ymm9
    vmovups ymm9, YMMWORD PTR [rax-160]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm5, ymm5, ymm9
    vmovups ymm9, YMMWORD PTR [rax-128]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm4, ymm4, ymm9
    vmovups ymm9, YMMWORD PTR [rax-96]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm3, ymm3, ymm9
    vmovups ymm9, YMMWORD PTR [rax-64]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm2, ymm2, ymm9
    vmovups ymm9, YMMWORD PTR [rax-32]
    cmp esi, edx
    vmulps  ymm0, ymm0, ymm9
    vaddps  ymm1, ymm1, ymm0
    jg  .L4

MSVC /FAc /O2 /openmp /arch:AVX ...

vbroadcastss ymm2, DWORD PTR [r10]    
lea  rax, QWORD PTR [rax+256]
lea  r10, QWORD PTR [r10+4] 
vmulps   ymm1, ymm2, YMMWORD PTR [rax-320]
vaddps   ymm3, ymm1, ymm3    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-288]
vaddps   ymm4, ymm1, ymm4    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm5, ymm1, ymm5    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-224]
vaddps   ymm6, ymm1, ymm6    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-192]
vaddps   ymm7, ymm1, ymm7    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-160]
vaddps   ymm8, ymm1, ymm8    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-128]
vaddps   ymm9, ymm1, ymm9    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-96]
vaddps   ymm10, ymm1, ymm10    
dec  rdx
jne  SHORT $LL3@AddDot4x4_

EDIT:

I benchmark the code by claculating the total floating point operations as 2.0*n^3 where n is the width of the square matrix and dividing by the time measured with omp_get_wtime(). I repeat the loop several times. In the output below I repeated it 100 times.

Output from MSVC2012 on an Intel Xeon E5 1620 (Ivy Bridge) turbo for all cores is 3.7 GHz

maximum GFLOPS = 236.8 = (8-wide SIMD) * (1 AVX mult + 1 AVX add) * (4 cores) * 3.7 GHz

n   64,     0.02 ms, GFLOPs   0.001, GFLOPs/s   23.88, error 0.000e+000, efficiency/core   40.34%, efficiency  10.08%, mem 0.05 MB
n  128,     0.05 ms, GFLOPs   0.004, GFLOPs/s   84.54, error 0.000e+000, efficiency/core  142.81%, efficiency  35.70%, mem 0.19 MB
n  192,     0.17 ms, GFLOPs   0.014, GFLOPs/s   85.45, error 0.000e+000, efficiency/core  144.34%, efficiency  36.09%, mem 0.42 MB
n  256,     0.29 ms, GFLOPs   0.034, GFLOPs/s  114.48, error 0.000e+000, efficiency/core  193.37%, efficiency  48.34%, mem 0.75 MB
n  320,     0.59 ms, GFLOPs   0.066, GFLOPs/s  110.50, error 0.000e+000, efficiency/core  186.66%, efficiency  46.67%, mem 1.17 MB
n  384,     1.39 ms, GFLOPs   0.113, GFLOPs/s   81.39, error 0.000e+000, efficiency/core  137.48%, efficiency  34.37%, mem 1.69 MB
n  448,     3.27 ms, GFLOPs   0.180, GFLOPs/s   55.01, error 0.000e+000, efficiency/core   92.92%, efficiency  23.23%, mem 2.30 MB
n  512,     3.60 ms, GFLOPs   0.268, GFLOPs/s   74.63, error 0.000e+000, efficiency/core  126.07%, efficiency  31.52%, mem 3.00 MB
n  576,     3.93 ms, GFLOPs   0.382, GFLOPs/s   97.24, error 0.000e+000, efficiency/core  164.26%, efficiency  41.07%, mem 3.80 MB
n  640,     5.21 ms, GFLOPs   0.524, GFLOPs/s  100.60, error 0.000e+000, efficiency/core  169.93%, efficiency  42.48%, mem 4.69 MB
n  704,     6.73 ms, GFLOPs   0.698, GFLOPs/s  103.63, error 0.000e+000, efficiency/core  175.04%, efficiency  43.76%, mem 5.67 MB
n  768,     8.55 ms, GFLOPs   0.906, GFLOPs/s  105.95, error 0.000e+000, efficiency/core  178.98%, efficiency  44.74%, mem 6.75 MB
n  832,    10.89 ms, GFLOPs   1.152, GFLOPs/s  105.76, error 0.000e+000, efficiency/core  178.65%, efficiency  44.66%, mem 7.92 MB
n  896,    13.26 ms, GFLOPs   1.439, GFLOPs/s  108.48, error 0.000e+000, efficiency/core  183.25%, efficiency  45.81%, mem 9.19 MB
n  960,    16.36 ms, GFLOPs   1.769, GFLOPs/s  108.16, error 0.000e+000, efficiency/core  182.70%, efficiency  45.67%, mem 10.55 MB
n 1024,    17.74 ms, GFLOPs   2.147, GFLOPs/s  121.05, error 0.000e+000, efficiency/core  204.47%, efficiency  51.12%, mem 12.00 MB

解决方案

Since we've covered the alignment issue, I would guess it's this: http://en.wikipedia.org/wiki/Out-of-order_execution

Since g++ issues a standalone load instruction, your processor can reorder the instructions to be pre-fetching the next data that will be needed while also adding and multiplying. MSVC throwing a pointer at mul makes the load and mul tied to the same instruction, so changing the execution order of the instructions doesn't help anything.

EDIT: Intel's server(s) with all the docs are less angry today, so here's more research on why out of order execution is (part of) the answer.

First of all, it looks like your comment is completely right about it being possible for the MSVC version of the multiplication instruction to decode to separate µ-ops that can be optimized by a CPU's out of order engine. The fun part here is that modern microcode sequencers are programmable, so the actual behavior is both hardware and firmware dependent. The differences in the generated assembly seems to be from GCC and MSVC each trying to fight different potential bottlenecks. The GCC version tries to give leeway to the out of order engine (as we've already covered). However, the MSVC version ends up taking advantage of a feature called "micro-op fusion". This is because of the µ-op retirement limitations. The end of the pipeline can only retire 3 µ-ops per tick. Micro-op fusion, in specific cases, takes two µ-ops that must be done on two different execution units (i.e. memory read and arithmetic) and ties them to a single µ-op for most of the pipeline. The fused µ-op is only split into the two real µ-ops right before execution unit assignment. After the execution, the ops are fused again, allowing them to be retired as one.

The out of order engine only sees the fused µ-op, so it can't pull the load op away from the multiplication. This causes the pipeline to hang while waiting for the next operand to finish its bus ride.

ALL THE LINKS!!!: http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/optimizing_assembly.pdf

http://www.agner.org/optimize/instruction_tables.ods (NOTE: Excel complains that this spreadsheet is partially corrupted or otherwise sketchy, so open at your own risk. It doesn't seem to be malicious, though, and according to the rest of my research, Agner Fog is awesome. After I opted-in to the Excel recovery step, I found it full of tons of great data)

http://cs.nyu.edu/courses/fall13/CSCI-GA.3033-008/Microprocessor-Report-Sandy-Bridge-Spans-Generations-243901.pdf

http://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf


MUCH LATER EDIT: Wow, there has been some interesting update to the discussion here. I guess I was mistaken about how much of the pipeline is actually affected by micro op fusion. Maybe there is more perf gain than I expected from the the differences in the loop condition check, where the unfused instructions allow GCC to interleave the compare and jump with the last vector load and arithmetic steps?

vmovups ymm9, YMMWORD PTR [rax-32]
cmp esi, edx
vmulps  ymm0, ymm0, ymm9
vaddps  ymm1, ymm1, ymm0
jg  .L4

这篇关于差异MSVC和海湾合作委员会之间的性能高度优化的矩阵multplication code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆