涉及罪()表现出截然不同的表现两个非常相似的功能 - 为什么? [英] Two very similar functions involving sin() exhibit vastly different performance -- why?

查看:103
本文介绍了涉及罪()表现出截然不同的表现两个非常相似的功能 - 为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下两种方案,在两种不同的方式执行相同的计算:

  // v1.c
#包括LT&;&stdio.h中GT;
#包括LT&;&math.h中GT;
诠释主要(无效){
   INT I,J;
   INT nbr_values​​ = 8192;
   INT n_iter = 100000;
   浮X;
   为(J = 0; J< nbr_values​​; J ++){
      X = 1;
      对于(i = 0; I< n_iter;我++)
         X = SIN(X);
   }
   的printf(%F \\ N,X);
   返回0;
}

  // v2.c
#包括LT&;&stdio.h中GT;
#包括LT&;&math.h中GT;
诠释主要(无效){
   INT I,J;
   INT nbr_values​​ = 8192;
   INT n_iter = 100000;
   浮X [nbr_values​​]
   对于(i = 0; I< nbr_values​​ ++我){
      X [I] = 1;
   }
   对于(i = 0; I< n_iter;我++){
      为(J = 0; J&下; nbr_values​​ ++ j)条{
         X [J] = SIN(X [J]);
      }
   }
   的printf(%F \\ N,X [0]);
   返回0;
}

当我使用gcc 4.7.2与 -O3 -ffast-数学编译它们并在Sandy Bridge的设备上运行,第二个程序是两倍的速度第一之一。

这是为什么?

一个疑是 I 循环的连续迭代​​之间的 V1 数据依赖性。不过,我不太看什么详细的解释可能。

(问题<一个启发href=\"http://stackoverflow.com/questions/14466950/why-is-my-python-numpy-example-faster-than-pure-c-implementation\">Why是我的Python / numpy的例子比纯C语言实现更快?)

编辑:

下面是生成的程序集 V1

  MOVL $ 8192%EBP
        pushq%RBX
LCFI1:
        SUBQ $ 8%RSP
LCFI2:
        .align伪4
L2:
        MOVL $ 100000%EBX
        MOVSS LC0(%RIP),%XMM0
        JMP L5
        .align伪4
L3:
        调用_sinf
L5:
        subl $ 1,EBX%
        JNE L3
        subl $ 1,EBP%
        .p2align 4日,2
        JNE L2

V2

  MOVL $ 100000%r14d
        .align伪4
L8:
        xorl%EBX,EBX%
        .align伪4
L9:
        MOVSS(%R12,%RBX),%XMM0
        调用_sinf
        MOVSS%XMM0,(%R12,%RBX)
        addq $ 4%RBX
        cmpq $ 32768%RBX
        JNE L9
        subl $ 1,%r14d
        JNE L8


解决方案

忽略循环结构都在一起,而且只想到电话为序列。 V1 执行以下操作:

  X&LT;  -  SIN(X)
X'LT; - SIN(X)
X'LT; - SIN(X)
...

这就是的每一个计算罪()不能开始,直到previous调用的结果是可用的;它必须等待previous计算的全部内容。这意味着,对于N调用,所需的总时间是一个评价819200000次延迟。

V2 ,相比之下,你做到以下几点:

  X [0]&LT;  -  SIN(X [0])
X [1]; - 罪(X [1])
X [2]所述; - 罪(X [2])
...

的通知每次调用不依赖于previous电话。有效地,调用都是独立的,并且处理器可以立即在每个开始作为必要寄存器ALU资源是可用的(不等待previous完成计算)。因此,所需要的时间是吞吐量的sin函数,而不是等待时间的函数,所以 V2 可以完成在显著的时间更少。


我也应该注意到,DeadMG是正确的 V1 V2 在形式上等同,并且在一个完美的世界编译器会优化他们两个到100000 评估(或简单地评价在编译时的结果)的单链。可悲的是,我们生活在一个完美的世界。

Consider the following two programs that perform the same computations in two different ways:

// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
   int i, j;
   int nbr_values = 8192;
   int n_iter = 100000;
   float x;
   for (j = 0; j < nbr_values; j++) {
      x = 1;
      for (i = 0; i < n_iter; i++)
         x = sin(x);
   }
   printf("%f\n", x);
   return 0;
}

and

// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
   int i, j;
   int nbr_values = 8192;
   int n_iter = 100000;
   float x[nbr_values];
   for (i = 0; i < nbr_values; ++i) {
      x[i] = 1;
   }
   for (i = 0; i < n_iter; i++) {
      for (j = 0; j < nbr_values; ++j) {
         x[j] = sin(x[j]);
      }
   }
   printf("%f\n", x[0]);
   return 0;
}

When I compile them using gcc 4.7.2 with -O3 -ffast-math and run on a Sandy Bridge box, the second program is twice as fast as the first one.

Why is that?

One suspect is the data dependency between successive iterations of the i loop in v1. However, I don't quite see what the full explanation might be.

(Question inspired by Why is my python/numpy example faster than pure C implementation?)

EDIT:

Here is the generated assembly for v1:

        movl    $8192, %ebp
        pushq   %rbx
LCFI1:
        subq    $8, %rsp
LCFI2:
        .align 4
L2:
        movl    $100000, %ebx
        movss   LC0(%rip), %xmm0
        jmp     L5
        .align 4
L3:
        call    _sinf
L5:
        subl    $1, %ebx
        jne     L3
        subl    $1, %ebp
        .p2align 4,,2
        jne     L2

and for v2:

        movl    $100000, %r14d
        .align 4
L8:
        xorl    %ebx, %ebx
        .align 4
L9:
        movss   (%r12,%rbx), %xmm0
        call    _sinf
        movss   %xmm0, (%r12,%rbx)
        addq    $4, %rbx
        cmpq    $32768, %rbx
        jne     L9
        subl    $1, %r14d
        jne     L8

解决方案

Ignore the loop structure all together, and only think about the sequence of calls to sin. v1 does the following:

x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...

that is, each computation of sin( ) cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin, the total time required is 819200000 times the latency of a single sin evaluation.

In v2, by contrast, you do the following:

x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...

notice that each call to sin does not depend on the previous call. Effectively, the calls to sin are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2 can finish in significantly less time.


I should also note that DeadMG is right that v1 and v2 are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.

这篇关于涉及罪()表现出截然不同的表现两个非常相似的功能 - 为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆