演示程序代码在禁用优化的情况下无法显示4倍的SIMD速度 [英] Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

查看:72
本文介绍了演示程序代码在禁用优化的情况下无法显示4倍的SIMD速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解使用SIMD矢量化的好处,并编写了一个简单的演示程序代码,以了解利用矢量化(SIMD)的算法在另一算法上的速度提高.这是2种算法:

I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms:

Alg_A-不支持向量:

#include <stdio.h>

#define SIZE 1000000000

int main() {
    printf("Algorithm with NO vector support\n");

    int a[] = {1, 2, 3, 4};
    int b[] = {5, 6, 7, 8};
    int i = 0;

    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

    printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);
}

Alg_B-具有矢量支持:

#include <stdio.h>

#define SIZE 1000000000

typedef int v4_i __attribute__ ((vector_size(16)));
union Vec4 {
    v4_i v;
    int i[4];
};

int main() {
    printf("Algorithm with vector support\n\n");

    union Vec4 a, b;
    a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
    b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
    int i = 0;
    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
        a.v = a.v + b.v;
    }

    printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);
}

编译如下:

Alg_A:

gcc -ggdb -mno-sse -mno-sse2 -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -c non_vector_support.c
gcc non_vector_support.o -o non_vector_support

Alg_B:

gcc -ggdb -c vector_support.c
gcc vector_support.o -o vector_support

查看两种算法的生成代码,我可以看到编译没有像自动矢量化"之类的任何技巧(例如,看不到 xmm 寄存器):

Looking at the generated code for both algorithms, I can see that the compilation did not do any tricks like 'auto-vectorization' (e.g. could not see xmm registers):

Alg_A:

    for (; i < SIZE; i++) {
  74:   eb 30                   jmp    a6 <main+0xa6>
        a[0] = a[0] + b[0];
  76:   8b 55 d8                mov    -0x28(%rbp),%edx
  79:   8b 45 e8                mov    -0x18(%rbp),%eax
  7c:   01 d0                   add    %edx,%eax
  7e:   89 45 d8                mov    %eax,-0x28(%rbp)
        a[1] = a[1] + b[1];
  81:   8b 55 dc                mov    -0x24(%rbp),%edx
  84:   8b 45 ec                mov    -0x14(%rbp),%eax
  87:   01 d0                   add    %edx,%eax
  89:   89 45 dc                mov    %eax,-0x24(%rbp)
        a[2] = a[2] + b[2];
  8c:   8b 55 e0                mov    -0x20(%rbp),%edx
  8f:   8b 45 f0                mov    -0x10(%rbp),%eax
  92:   01 d0                   add    %edx,%eax
  94:   89 45 e0                mov    %eax,-0x20(%rbp)
        a[3] = a[3] + b[3];
  97:   8b 55 e4                mov    -0x1c(%rbp),%edx
  9a:   8b 45 f4                mov    -0xc(%rbp),%eax
  9d:   01 d0                   add    %edx,%eax
  9f:   89 45 e4                mov    %eax,-0x1c(%rbp)
    int a[] = {1, 2, 3, 4};
    int b[] = {5, 6, 7, 8};
    int i = 0;

    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
  a2:   83 45 d4 01             addl   $0x1,-0x2c(%rbp)
  a6:   81 7d d4 ff c9 9a 3b    cmpl   $0x3b9ac9ff,-0x2c(%rbp)
  ad:   7e c7                   jle    76 <main+0x76>
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

    printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);

Alg_B:

    for (; i < SIZE; i++) {
  74:   eb 16                   jmp    8c <main+0x8c>
        a.v = a.v + b.v;
  76:   66 0f 6f 4d d0          movdqa -0x30(%rbp),%xmm1
  7b:   66 0f 6f 45 e0          movdqa -0x20(%rbp),%xmm0
  80:   66 0f fe c1             paddd  %xmm1,%xmm0
  84:   0f 29 45 d0             movaps %xmm0,-0x30(%rbp)
    union Vec4 a, b;
    a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
    b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
    int i = 0;
    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
  88:   83 45 cc 01             addl   $0x1,-0x34(%rbp)
  8c:   81 7d cc ff c9 9a 3b    cmpl   $0x3b9ac9ff,-0x34(%rbp)
  93:   7e e1                   jle    76 <main+0x76>
        a.v = a.v + b.v;
    }

    printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);

当我运行程序时,我希望看到速度提高4倍,但是,对于这种数据大小,增益似乎平均=〜1s,如果将SIZE增加到8000000000左右,则增益=〜5s.这是上面代码中的值的计时:

And when I run the programs, I was hoping to see an improvement in speed by a factor of 4 however, the gain appears to be on average =~ 1s for this size of data and if increased the SIZE to around 8000000000 the gain is =~ 5s. This is the timing for the value in the above code:

Alg_A:

Algorithm with NO vector support
Running loop 1000000000 times
A: [705032705 1705032706 -1589934589 -589934588]

real    0m3.279s
user    0m3.282s
sys     0m0.000s

Alg_B:

具有向量支持的算法

Running loop 1000000000 times
A: [705032705 705032706 -1589934589 -589934588]

real    0m2.609s
user    0m2.607s
sys     0m0.004s

计算与循环相关的开销.我为给定的SIZE运行了一个空循环,并在avg上获得了=〜2.2s.这使我的平均速度提高了约2.5倍.

Counting the overhead associated to the loop. I ran the an empty loop for the given SIZE and obtained =~ 2.2s on avg. Which gives me an average speed up of around 2.5 times.

我是否错过了代码或编译行中的某些内容?或者,这是否正确?在这种情况下,如果我每次迭代利用4个数据点和1条指令,为什么有人不知道为什么速度却提高4倍?

Have i missed something in the code or compilation lines? Or, is this suppose to be correct and in which case would someone know why isn't there a gain in 4 times in speed if I am exploiting 4 data points and 1 instruction at each iteration?

谢谢.

推荐答案

那一定是指令等待时间.(RAW依赖项)尽管ALU指令几乎没有等待时间,即结果可以是下一条指令的操作数而没有任何延迟,但SIMD指令往往具有较长的等待时间,直到结果可用于诸如add这样的简单指令为止.

That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add.

将数组扩展为16个甚至32个条目,跨越4个或8个SIMD向量,由于指令调度,您将看到巨大的差异.

Extend the arrays to 16 or even 32 entries long spanning over 4 or 8 SIMD vectors, and you will see huge differences thanks to instruction scheduling.

现在:加v潜伏加v潜伏..

NOW: add v latency add v latency . . .

4个向量旋转:添加v1添加v2添加v3添加v4添加v1添加v2..

4 vector rotation: add v1 add v2 add v3 add v4 add v1 add v2 . . .

Google用于指令调度"和原始依赖项",以获取更多详细信息.

Google for "instruction scheduling" and "raw dependency" for more detailed infos.

这篇关于演示程序代码在禁用优化的情况下无法显示4倍的SIMD速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆