编译器使用关联性有什么问题? [英] What's wrong with using associativity by compilers?

查看:30
本文介绍了编译器使用关联性有什么问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时关联性可用于松散数据依赖关系,我很好奇它有多大帮助.我很惊讶地发现,通过手动展开一个简单的循环,我几乎可以得到 4 倍的加速因子,无论是在 Java(构建 1.7.0_51-b13)中还是在 C(gcc 4.4)中.3).

所以要么我在做一些非常愚蠢的事情,要么编译器忽略了一个强大的工具.我开始于

int a = 0;对于 (int i=0; i

计算接近于 String.hashCode() 的东西(设置 M1=31 并使用 char[]).计算非常简单,对于 t.length=1000 在我的 i5-2400 @ 3.10GHz(Java 和 C)上大约需要 1.2 微秒.

观察每两个步骤 a 乘以 M2 = M1*M1 并添加一些东西.这导致了这段代码

int a = 0;for (int i=0; i

这恰好是第一个片段的两倍.奇怪的是,省略括号会降低 20% 的加速.有趣的是,这可以重复,并且可以达到 3.8 的因子.

与 java 不同,gcc -O3 选择不展开循环.这是明智的选择,因为无论如何它都无济于事(如 -funroll-all-loops 所示).

所以我的问题1是:是什么阻止了这样的优化?

谷歌搜索没有用,我得到了关联数组";和关联运算符"仅.

更新

我稍微改进了我的基准,并且可以现在提供一些结果.除了展开 4 次之外没有任何加速,可能是因为乘法和加法一起需要 4 个周期.

更新 2

由于 Java 已经展开循环,因此所有繁重的工作都已完成.我们得到的是这样的

...预循环for (int i=0; i

有趣的部分可以像这样重写

 a = M1 * ((M1 * a) + t[i]) + t[i+1];//延迟 2mul + 2add

这表明有 2 次乘法和 2 次加法,所有这些都按顺序执行,因此在现代 x86 CPU 上需要 8 个周期.我们现在需要的只是一些小学数学(即使在溢出或其他情况下也适用于 ints,但不适用于浮点).

 a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1];//延迟 2mul + 2add

到目前为止我们什么也没得到,但是它允许我们折叠常量

 a = ((M2 * a) + (M1 * t[i])) + t[i+1];//延迟 1mul + 2add

通过重组总和获得更多

 a = (M2 * a) + ((M1 * t[i]) + t[i+1]);//延迟 1mul + 1add

解决方案

以下是我对您的两种情况的理解: 在第一种情况下,您有一个需要 N 步的循环;在第二种情况下,您手动将第一种情况的两个连续迭代合并为一个,因此在第二种情况下您只需要执行 N/2 步.您的第二个案例运行得更快,您想知道为什么愚蠢的编译器不能自动完成.

没有什么可以阻止编译器进行这样的优化.但请注意,原始循环的这种重写会导致更大的可执行文件大小:您for 循环内有更多指令,循环后还有额外的 if.
如果 N=1 或 N=3,原始循环可能会更快(更少的分支和更好的缓存/预取/分支预测).它使在您的情况下速度更快,但在其他情况下可能会使速度变慢.是否值得进行这种优化并不清楚,而且在编译器中实现这种优化非常重要.

顺便说一下,您所做的与循环矢量化非常相似,但在您的情况下,您手动执行了并行步骤并插入了结果.Eric Brumer 的 Compiler Confidential 演讲将让您深入了解为什么重写循环通常是棘手以及有哪些缺点/缺点(更大的可执行文件大小,在某些情况下可能更慢).因此,编译器编写者非常清楚这种优化的可能性,并正在积极致力于它,但总的来说,它非常重要,而且也会使事情变慢.

<小时>

请为我尝试:

int a = 0;for (int i=0; i

假设 M1=31.原则上,编译器应该足够聪明,可以将 31*a 重写为 (a<<5)-a 但我很好奇它是否真的这样做了.>

Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).

So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with

int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];

which computes something close to String.hashCode() (set M1=31 and use a char[]). The computation is pretty trivial and for t.length=1000 takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).

Observe that each two steps a gets multiplied by M2 = M1*M1 and added something. This leads to this piece of code

int a = 0;
for (int i=0; i<N; i+=2) {
    a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.

This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.

Unlike java, gcc -O3 chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops shows).

So my question1 is: What prevents such an optimization?

Googling didn't work, I got "associative arrays" and "associative operators" only.

Update

I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.

Update 2

As Java already unrolls the loop, all the hard work is done. What we get is something like

...pre-loop
for (int i=0; i<N; i+=2) {
    a2 = M1 * a + t[i];
    a = M1 * a2 + t[i+1];
}
...post-loop

where the interesting part can be rewritten like

    a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add

This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for ints even in case of overflow or whatever, but not applicable to floating point).

    a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add

So far we gained nothing, but it allows us to fold the constants

    a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add

and gain even more by regrouping the sum

    a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add

解决方案

Here is how I understand your two cases: In the first case, you have a loop that takes N steps; in the second case you manually merged two consecutive iterations of the first case into one, so you only need to do N/2 steps in the second case. Your second case runs faster and you are wondering why the dumb compiler couldn't do it automatically.

There is nothing that would prevent the compiler from doing such an optimization. But please note that this re-write of the original loop leads to larger executable size: You have more instructions inside the for loop and the additional if after the loop.
If N=1 or N=3, the original loop is likely to be faster (less branching and better caching/prefetching/branch prediction). It made things faster in your case but it may make things slower in other cases. It is not a clear cut whether it is worth doing this optimization and it can be highly nontrivial to implement such an optimization in a compiler.

By the way, what you have done is very similar to loop vectorization but in your case, you did the parallel step manually and plugged-in the result. Eric Brumer's Compiler Confidential talk will give you insight why rewriting loops in general is tricky and what drawbacks / disadvantages there are (larger executable size, potentially slower in some cases). So compiler writers are very well aware of this optimization possibility and are actively working on it but it is highly nontrivial in general and can also make things slower.


Please try something for me:

int a = 0;
for (int i=0; i<N; ++i) 
  a = ((a<<5) - a) + t[i];

assuming M1=31. In principle, the compiler should be smart enough to rewrite 31*a into (a<<5)-a but I am curious if it really does that.

这篇关于编译器使用关联性有什么问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆