使用AVX进行乘加矢量化要比使用SSE慢 [英] Multiply-add vectorization slower with AVX than with SSE

查看:136
本文介绍了使用AVX进行乘加矢量化要比使用SSE慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一段正在激烈竞争的锁下运行的代码,因此它需要尽可能快.代码非常简单-这是对一堆数据的基本乘加运算,如下所示:

  for(int i = 0; i< size; i ++){c [i] + =(double)a [i] *(double)b [i];} 

在-O3下,启用了SSE支持,代码正按照我期望的方式进行矢量化处理.但是,启用AVX代码生成后,我的速度下降了10-15%,而不是加速,我不知道为什么.

这是基准代码:

  #include< chrono>#include< cstdio>#include< cstdlib>int main(){int size = 1<<20;float * a =新的float [大小];float * b =新的float [大小];double * c =新的double [size];为(int i = 0; i< size; i ++){a [i] = rand();b [i] = rand();c [i] = rand();}对于(int j = 0; j <10; j ++){自动开始= std :: chrono :: high_resolution_clock :: now();for(int i = 0; i< size; i ++){c [i] + =(double)a [i] *(double)b [i];}自动结束= std :: chrono :: high_resolution_clock :: now();自动持续时间= std :: chrono :: duration_cast< std :: chrono :: microseconds>(结束-开始).count();printf(%lluus \ n",持续时间);}} 

这是在SSE下生成的程序集:

  0x100007340< + 144> ;: cvtps2pd(%r13,%rbx,4),%xmm00x100007346< + 150> ;: cvtps2pd 0x8(%r13,%rbx,4),%xmm10x10000734c< + 156> ;: cvtps2pd(%r15,%rbx,4),%xmm20x100007351< + 161> ;:覆盖%xmm0,%xmm20x100007355< + 165> ;: cvtps2pd 0x8(%r15,%rbx,4),%xmm00x10000735b< + 171> ;:覆盖%xmm1,%xmm00x10000735f< + 175> ;: movupd(%r14,%rbx,8),%xmm10x100007365< + 181> ;: addpd%xmm2,%xmm10x100007369< + 185> ;: movupd 0x10(%r14,%rbx,8),%xmm20x100007370< + 192> ;: addpd%xmm0,%xmm20x100007374< + 196> ;: movupd%xmm1,(%r14,%rbx,8)0x10000737a< + 202> ;: movupd%xmm2,0x10(%r14,%rbx,8)0x100007381< + 209> ;: addq $ 0x4,%rbx0x100007385< + 213> ;: cmpq $ 0x100000,%rbx;imm = 0x1000000x10000738c< + 220> ;: jne 0x100007340;< + 144>在main.cpp:26:20 

运行SSE基准测试的结果:

  1411us1246us1243us1267us1242us1237us1246us1242us1250us1229us 

启用了AVX的生成的程序集:

  0x1000070b0< + 144> ;: vcvtps2pd(%r13,%rbx,4),%ymm00x1000070b7< + 151> ;: vcvtps2pd 0x10(%r13,%rbx,4),%ymm10x1000070be< + 158> ;: vcvtps2pd 0x20(%r13,%rbx,4),%ymm20x1000070c5< + 165> ;: vcvtps2pd 0x30(%r13,%rbx,4),%ymm30x1000070cc< + 172> ;: vcvtps2pd(%r15,%rbx,4),%ymm40x1000070d2< + 178> ;: vmulpd%ymm4,%ymm0,%ymm00x1000070d6< + 182> ;: vcvtps2pd 0x10(%r15,%rbx,4),%ymm40x1000070dd< + 189> ;: vmulpd%ymm4,%ymm1,%ymm10x1000070e1< + 193> ;: vcvtps2pd 0x20(%r15,%rbx,4),%ymm40x1000070e8< + 200> ;: vcvtps2pd 0x30(%r15,%rbx,4),%ymm50x1000070ef< + 207> ;: vmulpd%ymm4,%ymm2,%ymm20x1000070f3< + 211> ;: vmulpd%ymm5,%ymm3,%ymm30x1000070f7< + 215> ;: vaddpd(%r14,%rbx,8),%ymm0,%ymm00x1000070fd< + 221> ;: vaddpd 0x20(%r14,%rbx,8),%ymm1,%ymm10x100007104< + 228> ;: vaddpd 0x40(%r14,%rbx,8),%ymm2,%ymm20x10000710b< + 235> ;: vaddpd 0x60(%r14,%rbx,8),%ymm3,%ymm30x100007112< + 242> ;: vmovupd%ymm0,(%r14,%rbx,8)0x100007118< + 248> ;: vmovupd%ymm1,0x20(%r14,%rbx,8)0x10000711f< + 255> ;: vmovupd%ymm2,0x40(%r14,%rbx,8)0x100007126< + 262> ;: vmovupd%ymm3,0x60(%r14,%rbx,8)0x10000712d< + 269> ;: addq $ 0x10,%rbx0x100007131< + 273> ;: cmpq $ 0x100000,%rbx;imm = 0x1000000x100007138< + 280> ;: jne 0x1000070b0;< + 144>在main.cpp:26:20 

运行AVX基准测试的结果:

  1532us1404us1480us1464us1410us1383us1333us1362us1494us1526us 

请注意,使用两倍于SSE的指令生成的AVX代码并不重要-我尝试过手动进行较小的展开(以匹配SSE),而AVX仍然较慢.

对于上下文,我使用的是MacOS 11和Xcode 12,以及Mac Pro 6.1(垃圾桶)和Intel Xeon CPU E5-1650 v2 @ 3.50GHz.

解决方案

更新:对齐没有多大帮助.可能还有另一个瓶颈,例如在打包的float->双重转换中?另外, vcvtps2pd(%r13,%rbx,4),%ymm0 仅具有16字节的内存源,因此只有存储区为32字节.我们没有任何32字节拆分负载.(我在仔细看代码之前写了下面的答案.)


那是一个IvyBridge CPU.您的数据是否按32对齐?如果不是这样,这是众所周知的事实,即在32个字节的加载或存储上进行高速缓存行拆分对于那些旧的微体系结构来说是一个严重的瓶颈.那些早期支持Intel AVX的CPU具有全宽ALU,但它们运行32字节负载,并以2个独立的数据周期从同一uop 1 存储在执行单元中,从而使高速缓存行拆分为额外的特殊情况(且速度特别慢).( https://www.realworldtech.com/sandy-bridge/7/).与 Haswell (和Zen 2)及更高版本不同,后者具有32字节的数据路径2 .

太慢了,以至于GCC的默认 -mtune = generic 代码生成将即使是分割的256位AVX加载,并存储 在编译时不对齐的.(这是过度杀伤的方式,尤其是在较新的CPU上,和/或当数据实际对齐但编译器不知道时,或在通常情况下数据对齐时,会造成伤害,但是偶尔该功能仍需要工作未对齐的数组,让硬件处理该特殊情况,而不是在普通情况下运行更多指令甚至检查该特殊情况.)

但是您使用的是clang,它在此处提供了一些不错的代码(以4x展开),可以在对齐数组或在像Haswell这样的较新CPU上很好地执行.不幸的是,它使用索引寻址模式,无法实现很多展开目的(特别是对于Intel Sandybridge/Ivy Bridge),因为负载和ALU uop会分层并分别经过前端.微融合和寻址模式.(Haswell可以在SSE情况下使其中一些保持微融合,但在商店等AVX则不可以.)

您可以使用 aligned_alloc ,或者使用C ++ 17对齐的 new 进行某些操作,以获得与 delete 兼容的对齐分配.

普通的 new 可能会为您提供一个指针,该指针以16对齐,但未以32对齐.我不知道MacOS,但在Linux上,glibc的大型分配器通常为16个字节保留在页面开始处进行簿记,因此通常您会获得大的分配空间,该空间与对齐方式相差16字节,大于16.


脚注2:那个在加载端口中花费第二个周期的单线程仍然只生成一次地址.这允许另一个uop(例如存储地址uop)在发生第二个数据周期时使用AGU.因此,它不会干扰地址处理部分已完全流水线化.

SnB/IvB仅具有2个AGU/加载端口,因此通常每个时钟最多可以执行2个存储器操作,其中之一是存储.但是对于32字节的加载和存储,每2个数据周期仅需要一个地址(并且存储数据已经是与存储地址不同的另一个端口的uop),这使SnB/IvB可以实现每个时钟2 +1加载+存储,用于特殊情况下的32字节加载和存储.(这占用了大部分前端带宽,因此通常需要将这些负载微融合为另一条指令的存储源操作数.)

另请参阅我在我的答案如何在电子产品上如此快地缓存?.

脚注1: Zen 1(和Bulldozer系列)将所有32字节操作解码为2个独立的uops,因此没有特殊情况.可以将一半的负载分配给整个缓存行,这就像来自 xmm 负载的16字节负载一样.

I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this:

for( int i = 0; i < size; i++ )
{
    c[i] += (double)a[i] * (double)b[i];
}

Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why.

Here's the benchmark code:

#include <chrono>
#include <cstdio>
#include <cstdlib>

int main()
{
    int size = 1 << 20;

    float *a = new float[size];
    float *b = new float[size];
    double *c = new double[size];

    for (int i = 0; i < size; i++)
    {
        a[i] = rand();
        b[i] = rand();
        c[i] = rand();
    }

    for (int j = 0; j < 10; j++)
    {
        auto begin = std::chrono::high_resolution_clock::now();

        for( int i = 0; i < size; i++ )
        {
            c[i] += (double)a[i] * (double)b[i];
        }

        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();

        printf("%lluus\n", duration);
    }
}

Here's the generated assembly under SSE:

0x100007340 <+144>:  cvtps2pd (%r13,%rbx,4), %xmm0
0x100007346 <+150>:  cvtps2pd 0x8(%r13,%rbx,4), %xmm1
0x10000734c <+156>:  cvtps2pd (%r15,%rbx,4), %xmm2
0x100007351 <+161>:  mulpd  %xmm0, %xmm2
0x100007355 <+165>:  cvtps2pd 0x8(%r15,%rbx,4), %xmm0
0x10000735b <+171>:  mulpd  %xmm1, %xmm0
0x10000735f <+175>:  movupd (%r14,%rbx,8), %xmm1
0x100007365 <+181>:  addpd  %xmm2, %xmm1
0x100007369 <+185>:  movupd 0x10(%r14,%rbx,8), %xmm2
0x100007370 <+192>:  addpd  %xmm0, %xmm2
0x100007374 <+196>:  movupd %xmm1, (%r14,%rbx,8)
0x10000737a <+202>:  movupd %xmm2, 0x10(%r14,%rbx,8)
0x100007381 <+209>:  addq   $0x4, %rbx
0x100007385 <+213>:  cmpq   $0x100000, %rbx           ; imm = 0x100000 
0x10000738c <+220>:  jne    0x100007340               ; <+144> at main.cpp:26:20

Results from running SSE benchmark:

1411us
1246us
1243us
1267us
1242us
1237us
1246us
1242us
1250us
1229us

Generated assembly with AVX enabled:

0x1000070b0 <+144>:  vcvtps2pd (%r13,%rbx,4), %ymm0
0x1000070b7 <+151>:  vcvtps2pd 0x10(%r13,%rbx,4), %ymm1
0x1000070be <+158>:  vcvtps2pd 0x20(%r13,%rbx,4), %ymm2
0x1000070c5 <+165>:  vcvtps2pd 0x30(%r13,%rbx,4), %ymm3
0x1000070cc <+172>:  vcvtps2pd (%r15,%rbx,4), %ymm4
0x1000070d2 <+178>:  vmulpd %ymm4, %ymm0, %ymm0
0x1000070d6 <+182>:  vcvtps2pd 0x10(%r15,%rbx,4), %ymm4
0x1000070dd <+189>:  vmulpd %ymm4, %ymm1, %ymm1
0x1000070e1 <+193>:  vcvtps2pd 0x20(%r15,%rbx,4), %ymm4
0x1000070e8 <+200>:  vcvtps2pd 0x30(%r15,%rbx,4), %ymm5
0x1000070ef <+207>:  vmulpd %ymm4, %ymm2, %ymm2
0x1000070f3 <+211>:  vmulpd %ymm5, %ymm3, %ymm3
0x1000070f7 <+215>:  vaddpd (%r14,%rbx,8), %ymm0, %ymm0
0x1000070fd <+221>:  vaddpd 0x20(%r14,%rbx,8), %ymm1, %ymm1
0x100007104 <+228>:  vaddpd 0x40(%r14,%rbx,8), %ymm2, %ymm2
0x10000710b <+235>:  vaddpd 0x60(%r14,%rbx,8), %ymm3, %ymm3
0x100007112 <+242>:  vmovupd %ymm0, (%r14,%rbx,8)
0x100007118 <+248>:  vmovupd %ymm1, 0x20(%r14,%rbx,8)
0x10000711f <+255>:  vmovupd %ymm2, 0x40(%r14,%rbx,8)
0x100007126 <+262>:  vmovupd %ymm3, 0x60(%r14,%rbx,8)
0x10000712d <+269>:  addq   $0x10, %rbx
0x100007131 <+273>:  cmpq   $0x100000, %rbx           ; imm = 0x100000 
0x100007138 <+280>:  jne    0x1000070b0               ; <+144> at main.cpp:26:20

Results from running AVX benchmark:

1532us
1404us
1480us
1464us
1410us
1383us
1333us
1362us
1494us
1526us

Note that AVX code being generated with twice as many instructions as SSE doesn't really matter - I've tried smaller unroll by hand (to match SSE) and AVX was still slower.

For context, I'm using macOS 11 and Xcode 12, with Mac Pro 6.1 (trashcan) with Intel Xeon CPU E5-1650 v2 @ 3.50GHz.

解决方案

Update: alignment didn't help much/at all. There may also be another bottleneck, e.g. in packed float->double conversion? Also, vcvtps2pd (%r13,%rbx,4), %ymm0 only has a 16-byte memory source, so only the stores are 32-byte. We don't have any 32-byte split loads. (I wrote the answer below before looking carefully enough at the code.)


That's an IvyBridge CPU. Is your data aligned by 32? If not, it's a known fact that cache-line splits on 32 byte loads or stores are a serious bottleneck for those old microarchitectures. Those early Intel AVX-supporting CPUs have full width ALUs, but they run 32-byte load and store as 2 separate data cycles in the execution units from the same uop1, making a cache-line split an extra special (and extra slow) case. (https://www.realworldtech.com/sandy-bridge/7/). Unlike Haswell (and Zen 2) and later which have 32-byte data paths2.

So slow that GCC's default -mtune=generic code-generation will even split 256-bit AVX loads and stores that aren't known at compile time to be aligned. (This is way overkill and hurts especially on newer CPUs, and/or when the data actually is aligned but the compiler doesn't know it, or when data is aligned in the common case, but the function needs to still work on the occasional misaligned array, letting the hardware deal with that special case instead of running more instructions in the common case to even check for that special case.)

But you're using clang, which makes somewhat nice code here (unrolled by 4x) that would perform well with aligned arrays, or on a newer CPU like Haswell. Unfortunately it uses indexed addressing modes, defeating much of the purpose of unrolling (for Intel Sandybridge / Ivy Bridge especially) because the load and ALU uop will unlaminate and go through the front-end separately. Micro fusion and addressing modes. (Haswell can keep some of them micro-fused for the SSE case but not AVX, e.g. the stores.)

You can use aligned_alloc, or maybe do something with C++17 aligned new to get an aligned allocation that's compatible with delete.

Plain new may be giving you a pointer aligned by 16, but misaligned by 32. I don't know about MacOS, but on Linux glibc's allocator for large-ish allocations typically keeps 16 bytes for bookkeeping at the start of a page, so you typically get large allocations that are 16 bytes away from alignment by anything larger than 16.


Footnote 2: That single-uop that spends a 2nd cycle in a load port still only does address-generation once. That allows another uop (e.g. a store-address uop) to use the AGU while the 2nd data cycle is happening. So it doesn't interfere with the address-handling part being fully pipelined.

SnB / IvB only have 2 AGU/load ports, so normally can execute up to 2 memory operations per clock, up to one of which is a store. But with 32-byte loads and stores only needing an address every 2 data cycles (and store-data already being a separate uop for another port from store-address), that allows SnB / IvB to achieve 2 + 1 load+store per clock, sustained, for the special case of 32-byte loads and stores. (That uses most of the front-end bandwidth, so those loads typically need to be micro-fused as a memory source operand for another instruction.)

See also my answer on How can cache be that fast? on electronics.SE.

Footnote 1: Zen 1 (and Bulldozer-family) decode all 32-byte operations to 2 separate uops so there's no special case. One half of the load can be split across the cache line, and that would be exactly like a 16-byte load that came from an xmm load.

这篇关于使用AVX进行乘加矢量化要比使用SSE慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆