为什么我的程序这么慢? [英] Why is my program so slow?

查看:70
本文介绍了为什么我的程序这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人决定进行一次快速测试,以查看本地客户端与javascript在速度方面的比较.他们通过运行1 000 000 sqrt计算并测量所花费的时间来做到这一点.使用javascript的结果:0.096秒,使用NaCl的结果:4.241秒...那怎么可能?速度不是首先使用NaCl的原因之一吗?还是我缺少一些编译器标记或其他东西?

Someone decided to do a quick test to see how native client compared to javascript in terms of speed. They did that by running 10 000 000 sqrt calculations and measuring the time it took. The result with javascript: 0.096 seconds, and with NaCl: 4.241 seconds... How can that be? Isn't speed one of the reasons to use NaCl in the first place? Or am i missing some compiler flags or something?

在这里运行的代码:

clock_t t = clock();
float result = 0;
for(int i = 0; i < 10000000; ++i) {
    result += sqrt(i);
}
t = clock() - t;      
float tt = ((float)t)/CLOCKS_PER_SEC;
pp::Var var_reply = pp::Var(tt);
PostMessage(var_reply);

PS:此问题是本地客户邮件列表

推荐答案

注意:此答案是微基准测试很棘手:除非您非常了解自己的工作方式,否则很容易产生与您根本不希望观察/测量的行为相关的苹果到橙子的比较.

Microbenchmarks are tricky: unless you understand what you are doing VERY well it's easy to produce apples-to-oranges comparisons which are not relevant to the behavior you want to observe/measure at all.

我将使用您自己的示例进行详细说明(我将排除NaCl并坚持使用现有的尝试过的和真实的"技术).

I'll elaborate a bit using your own example (I'll exclude NaCl and stick to the existing, "tried and true" technologies).

这是您作为本机C程序的测试:

Here is your test as native C program:

$ cat test1.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += sqrt(i);
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g %g\n", result, tt);

}
$ gcc -std=c99 -O2 test1.c -lm -o test1
$ ./test1
5.49756e+11 25.43

好的.我们可以在25.43秒内完成十亿个周期.但是,让我们看一下需要花费时间:让我们替换结果+ = sqrt(i);"带有结果+ = i;"

Ok. We can do billion cycles in 25.43 seconds. But let's see what takes time: let's replace "result += sqrt(i);" with "result += i;"

$ cat test2.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += i;
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g %g\n", result, tt);
}
$ gcc -std=c99 -O2 test2.c -lm -o test2
$ ./test2
1.80144e+16 1.21

哇!实际上有95%的时间花在了CPU提供的sqrt功能上,其他所有东西花费的时间都少于5%.但是,如果我们只需稍稍更改一下代码,该怎么办:替换为"printf("%g%g \ n,result,tt);"用"printf("%g \ n,tt);" ?

Wow! 95% of time was actually spend in CPU-provided sqrt function, everything else took less then 5%. But what if we'll change the code just a bit: replace "printf("%g %g\n", result, tt);" with "printf("%g\n", tt);" ?

$ cat test3.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  float result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += sqrt(i);
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%g\n", tt);
}
$ gcc -std=c99 -O2 test3.c -lm -o test3
$ ./test
1.44

嗯...看起来现在"sqrt"几乎和"+"一样快.怎么会这样? printf会如何影响上一个周期?

Hmm... Looks like now "sqrt" is almost as fast as "+". How can this be? How can printf affect the previous cycle AT ALL?

让我们看看:

$ gcc -std=c99 -O2 test1.c -S -o -
...
.L3:
        cvtsi2sd        %ebp, %xmm1
        sqrtsd  %xmm1, %xmm0
        ucomisd %xmm0, %xmm0
        jp      .L7
        je      .L2
.L7:
        movapd  %xmm1, %xmm0
        movss   %xmm2, (%rsp)
        call    sqrt
        movss   (%rsp), %xmm2
.L2:
        unpcklps        %xmm2, %xmm2
        addl    $1, %ebp
        cmpl    $1000000000, %ebp
        cvtps2pd        %xmm2, %xmm2
        addsd   %xmm0, %xmm2
        unpcklpd        %xmm2, %xmm2
        cvtpd2ps        %xmm2, %xmm2
        jne     .L3
 ...
$ gcc -std=c99 -O2 test3.c -S -o -
...
        xorpd   %xmm1, %xmm1
...
.L5:
        cvtsi2sd        %ebp, %xmm0
        ucomisd %xmm0, %xmm1
        ja      .L14
.L10:
        addl    $1, %ebp
        cmpl    $1000000000, %ebp
        jne     .L5
...
.L14:
        sqrtsd  %xmm0, %xmm2
        ucomisd %xmm2, %xmm2
        jp      .L12
        .p2align 4,,2
        je      .L10
.L12:
        movsd   %xmm1, (%rsp)
        .p2align 4,,5
        call    sqrt
        movsd   (%rsp), %xmm1
        .p2align 4,,4
        jmp     .L10
...

第一个版本实际上调用了sqrt十亿次,但是第二个版本根本不执行该操作!相反,它检查数字是否为负,仅在这种情况下才调用sqrt!为什么?编译器(或更确切地说是编译器作者)在这里试图做什么?

First version actually calls sqrt billion times, but second one does not do that at all! Instead it checks if the number is negative and calls sqrt only in this case! Why? What the compiler (or, rather, compiler authors) are trying to do here?

好吧,这很简单:由于在此特定版本中我们未使用结果",因此可以安全地忽略"sqrt"调用...如果该值不为负,那就是!如果为负,则sqrt可以执行不同的操作(取决于FPU标志)(返回无意义的结果,使程序崩溃等).这就是为什么此版本速度快几十倍的原因-但它根本不计算平方根!

Well, it's simple: since we've not used "result" in this particular version it can safely omit "sqrt" call... if the value is not negative, that is! If it's negative then (depending on FPU flags) sqrt can do different things (return nonsensical result, crash the program, etc). That's why this version is dozen of times faster - but it does not calculate square roots at all!

这是最后一个示例,显示了错误的微基准测试如何进行:

Here is final example which shows how wrong microbenchmarks can go:

$ cat test4.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int main() {
  clock_t t = clock();
  int result = 0;
  for(int i = 0; i < 1000000000; ++i) {
      result += 2;
  }
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test4.c -lm -o test4
$ ./test4
2000000000 0

执行时间是...零?怎么会这样?十亿次计算不到然后眨眼间?让我们看看:

Execution time is... ZERO? How can it be? Billion calculations in less then blink of eye? Let's see:

$ gcc -std=c99 -O2 test1.c -S -o -
...
        call    clock
        movq    %rax, %rbx
        call    clock
        subq    %rbx, %rax
        movl    $2000000000, %edx
        movl    $.LC1, %esi
        cvtsi2ssq       %rax, %xmm0
        movl    $1, %edi
        movl    $1, %eax
        divss   .LC0(%rip), %xmm0
        unpcklps        %xmm0, %xmm0
        cvtps2pd        %xmm0, %xmm0
...

哦,哦,周期被完全消除了!所有计算都是在编译时进行的,这更糟了,这两个时钟"调用都是在启动循环之前执行的!

Uh, oh, cycle is completely eliminated! All calculations happened at compile time and to add insult to injury both "clock" calls were executed before body of the cycle to boot!

如果我们将其放在单独的函数中怎么办?

What if we'll put it in separate function?

$ cat test5.c
#include <math.h>
#include <time.h>
#include <stdio.h>

int testfunc(int num, int max) {
  int result = 0;
  for(int i = 0; i < max; ++i) {
      result += num;
  }
  return result;
}

int main() {
  clock_t t = clock();
  int result = testfunc(2, 1000000000);
  t = clock() - t;
  float tt = ((float)t)/CLOCKS_PER_SEC;
  printf("%d %g\n", result, tt);
}
$ gcc -std=c99 -O2 test5.c -lm -o test5
$ ./test5
2000000000 0

还是一样吗???怎么会这样?

Still the same??? How can this be?

$ gcc -std=c99 -O2 test5.c -S -o -
...
.globl testfunc
        .type   testfunc, @function
testfunc:
.LFB16:
        .cfi_startproc
        xorl    %eax, %eax
        testl   %esi, %esi
        jle     .L3
        movl    %esi, %eax
        imull   %edi, %eax
.L3:
        rep
        ret
        .cfi_endproc
...

嗯:编译器足够聪明,可以用乘法代替循环!

Uh-oh: compiler is clever enough to replace cycle with a multiplication!

现在,如果您要在一侧添加NaCl并在另一侧添加JavaScript,您将得到一个如此复杂的系统,其结果实际上是不可预测的.

Now if you'll add NaCl on one side and JavaScript on the other side you'll get such a complex system that results are literally unpredictable.

这里的问题是,对于微基准测试,您尝试隔离一段代码然后评估其属性,但是随后编译器(无论是JIT还是AOT)都将挫败您的工作,因为它试图从中删除所有无用的计算您的程序!

The problem here is that for microbenchmark you are trying to isolate piece of code and then evaluate it's properties, but then compiler (no matter JIT or AOT) will try to thwart your efforts because it tries to remove all the useless calculations from your program!

微基准当然有用,但是它们是取证分析工具,不是您要用来比较两个不同系统速度的工具!为此,您需要一些真实的"(某种意义上的世界:有些东西不能被急切的编译器优化)工作量:尤其是排序算法.

Microbenchmarks useful, sure, but they are FORENSIC ANALYSIS tool, not something you want to use to compare speed of two different systems! For that you need some "real" (in some sense of the world: something which can not be optimized to pieces by over-eager compiler) workload: sorting algorithms are popular, in particular.

使用sqrt的基准特别讨厌,因为如我们所见,通常它们花费90%以上的时间执行一条CPU指令:sqrtsd(如果是32位版本,则为fsqrt),当然,对于JavaScript和NaCl.这些基准测试(如果已正确实施)可以用作试金石(如果某些实施速度与简单本机版本所显示的速度相差太大,则说明您做错了),但是它们对于比较NaCl,JavaScript,C#的速度没有用或Visual Basic.

Benchmarks which use sqrt are especially nasty because, as we've seen, usually they spend over 90% of time executing one single CPU instruction: sqrtsd (fsqrt if it's 32-bit version) which is, of course, identical for JavaScript and NaCl. These benchmarks (if properly implemented) may serve as a litmus test (if speed of some implementation differs too much from what simple native version exhibits then you are doing something wrong), but they are useless as comparison of speeds of NaCl, JavaScript, C# or Visual Basic.

这篇关于为什么我的程序这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆