这有什么错用关联性由编译器? [英] What's wrong with using associativity by compilers?

查看:121
本文介绍了这有什么错用关联性由编译器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时相关性可以用来松散的数据依赖和我很好奇,多少它可以提供帮助。我非常惊讶地发现,我几乎可以得到一个加速的 4 通过手动展开一个平凡的循环,无论是在爪哇(建1.7.0_51-B13)因素,并在C(GCC 4.4 0.3)。

因此​​,无论我做什么pretty愚蠢或编译器忽略的有力工具。我开始用

  int类型的= 0;
的for(int i = 0; I< N ++Ⅰ)= M1 * A + T [I]

该计算一些接近 String.hash code()(设置 M1 = 31 和使用一个的char [] )。该计算是pretty琐碎和 t.length = 1000 发生在我i5-2400约1.2微秒@ 3.10GHz(包括Java和C)。

观察每两个步骤 A 得到由乘以M2 = M1 * M1 和补上一。这导致这块code的

  int类型的= 0;
的for(int i = 0; I< N,I + = 2){
    A = M2 * A +(M1 * T [I] + T [I + 1]); //< - 注意括号!
}
如果(I< LEN)A = M1 * A + T [I] //处理奇数长度。

这正是快两倍作为第一个片段。奇怪的是,在离开了括号吃高速化的20%。有趣的是,这可以重复并且可以实现3.8倍。

Java不同,的gcc -O3 选择不展开循环。这是明智的选择,因为它不会帮助无论如何(如 -funroll-全循环显示)。

所以我的问题 1 是:什么prevents这样的优化

谷歌搜索没有工作,我得到了关联数组和关联经营者而已。

更新

我擦亮了我的基准一点点并能提供现在有些结果的。有没有加速超越展开4次,可能是因为乘法和加法的,一起服用4个周期。

更新2

从Java已经展开循环,所有的辛勤工作已经完成。我们得到的是一样的东西。

  ... pre环
的for(int i = 0; I< N,I + = 2){
    A2 = M1 * A + T [I]
    A = M1 * A2 + T [I + 1];
}
...后环

其中最有趣的部分可以改写像

  A = M1 *((M1 * A)+ T [i])+ T [I + 1]; //延迟2mul + 2add

这表明,有2次乘法和加法2,他们都按顺序进行的,因此需要一个现代的x86处理器上的8个周期。现在我们需要的是一些小学数学(适用于 INT 工作件更溢出或什么的情况下,但并不适用于浮点)。

  A =((M1 *(M1 * A))+(M1 * T [i]))+ T [I + 1]; //延迟2mul + 2add

到目前为止,我们一无所获,但它使我们能够折叠的常量

  A =((M2 * A)+(M1 * T [i]))+ T [I + 1]; //延迟1mul + 2add

和重组的总和,从而获得更高

  A =(M2 * A)+((M1 * T [I])+ T [I + 1]); //延迟1mul + 1add


解决方案

这里是我理解你的两种情况:在第一种情况下,你有一个循环,需要N个步骤;在第二种情况下手动合并第一案连续两个迭代成一体,所以你只需要做在第二种情况下N / 2的步骤。你的第二个情况下运行速度更快,你想知道为什么是喑哑的编译器不能自动完成。

没有什么会prevent从做这样的优化编译器。但是,请注意,原始循环的这种重写导致的较大的可执行文件的大小:你有循环和附加内更多的指令如果在循环后结果。
如果N = 1或N = 3,原环可能会更快(少分支和更好的缓存/ prefetching /支prediction)。它在你的情况做事情更快的 的,但它可能使事情在其他情况下慢。它不是一个明确的是否值得这样做的优化,它可以为高度平凡的编译器来实现这样的优化

顺便说一句,你做了什么是非常相似的循环矢量但在你的情况,你做手工并行的步骤和插入的结果。埃里克Brumer的编译机密讲座将让您深入了解为什么在一般的重写循环是棘手的,什么缺点/缺点有(较大的可执行文件的大小,可能速度较慢在某些情况下)。所以编译器作者都非常清楚这一点的可能性优化,并正在积极努力,但它是在普通平凡的高度,也可以使事情更慢。


请尝试我的东西:

  int类型的= 0;
的for(int i = 0; I< N ++ I)
  一个=((A&所述;小于5) - 一)+ T [I];

假设 M1 = 31 。原则上,编译器应该足够聪明来重写 31 * A (A<< 5)-a 但我很好奇,如果它真的做到这一点。

Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).

So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with

int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];

which computes something close to String.hashCode() (set M1=31 and use a char[]). The computation is pretty trivial and for t.length=1000 takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).

Observe that each two steps a gets multiplied by M2 = M1*M1 and added something. This leads to this piece of code

int a = 0;
for (int i=0; i<N; i+=2) {
    a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.

This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.

Unlike java, gcc -O3 chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops shows).

So my question1 is: What prevents such an optimization?

Googling didn't work, I got "associative arrays" and "associative operators" only.

Update

I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.

Update 2

As Java already unrolls the loop, all the hard work is done. What we get is something like

...pre-loop
for (int i=0; i<N; i+=2) {
    a2 = M1 * a + t[i];
    a = M1 * a2 + t[i+1];
}
...post-loop

where the interesting part can be rewritten like

    a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add

This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for ints even in case of overflow or whatever, but not applicable to floating point).

    a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add

So far we gained nothing, but it allows us to fold the constants

    a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add

and gain even more by regrouping the sum

    a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add

解决方案

Here is how I understand your two cases: In the first case, you have a loop that takes N steps; in the second case you manually merged two consecutive iterations of the first case into one, so you only need to do N/2 steps in the second case. Your second case runs faster and you are wondering why the dumb compiler couldn't do it automatically.

There is nothing that would prevent the compiler from doing such an optimization. But please note that this re-write of the original loop leads to larger executable size: You have more instructions inside the for loop and the additional if after the loop.
If N=1 or N=3, the original loop is likely to be faster (less branching and better caching/prefetching/branch prediction). It made things faster in your case but it may make things slower in other cases. It is not a clear cut whether it is worth doing this optimization and it can be highly nontrivial to implement such an optimization in a compiler.

By the way, what you have done is very similar to loop vectorization but in your case, you did the parallel step manually and plugged-in the result. Eric Brumer's Compiler Confidential talk will give you insight why rewriting loops in general is tricky and what drawbacks / disadvantages there are (larger executable size, potentially slower in some cases). So compiler writers are very well aware of this optimization possibility and are actively working on it but it is highly nontrivial in general and can also make things slower.


Please try something for me:

int a = 0;
for (int i=0; i<N; ++i) 
  a = ((a<<5) - a) + t[i];

assuming M1=31. In principle, the compiler should be smart enough to rewrite 31*a into (a<<5)-a but I am curious if it really does that.

这篇关于这有什么错用关联性由编译器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆