最小化Java函数调用开销 [英] Minimizing Java function call overhead

查看:79
本文介绍了最小化Java函数调用开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经一段代码出现在每次测试中,函数调用有很大的开销。代码是一个紧凑的循环,对数组的每个元素执行一个非常简单的函数(包含4-8百万 int s)。

I have a piece of code where it appears, in every test I've run, that function calls have a significant amount of overhead. The code is a tight loop, performing a very simple function on each element of an array (containing 4-8 million ints).

运行代码,主要包括

for (int y = 1; y < h; ++y) {
    for (int x = 1; x < w; ++x) {
        final int p = y * s + x;
        n[p] = f.apply(d, s, x, y);
    }
}

执行类似

(final int[] d, final int s, final int x, final int y) -> {
    final int p = s * y + x;
    final int a = d[p] * 2
                + d[p - 1]
                + d[p + 1]
                + d[p - s]
                + d[p + s];
    return (1000 * (a + 500)) / 6000;
};

(我的工作笔记本电脑,带有i7 3840QM的W530,一个核心的服务器虚拟机一个Xeon E5-1620和一个具有一个未知CPU核心的Digital Ocean节点,在调用方法与内联时,我反复获得统计上显着的性能损失。所有测试均在Java 1.8.0_11(Java HotSpot(TM)64位服务器VM)上执行。

on various machines (my work laptop, a W530 with i7 3840QM, a server VM with one core of a Xeon E5-1620, and a Digital Ocean node with one core of an unknown CPU), I repeatedly get a statistically significant performance hit when calling a method vs inlining. All tests were performed on Java 1.8.0_11 (Java HotSpot(TM) 64-Bit Server VM).

工作机器:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       40.860        0.184    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200       22.603        0.159    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       22.792        0.117    ops/s

专用服务器,VM:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       40.685        0.224    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200       16.077        0.113    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       23.827        0.088    ops/s

DO VPS:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       24.425        0.506    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200        9.643        0.140    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       13.733        0.134    ops/s

所有可接受的表现,但我有兴趣弄清楚为什么这个电话有这样的表现显着的开销以及可以采取哪些措施来优化它。目前正在尝试不同的参数集。

All acceptable performance, but I am interested in figuring out why the call has such significant overhead and what can be done to optimize that. Currently experimenting with different sets of parameters.

内联所有潜在的操作将很困难,但理论上可行。接近2倍的性能提升,可能值得,但维护将是一场噩梦。

Inlining all the potential operations would be difficult, but theoretically possible. For close to a 2x performance increase, potentially worth it, but maintenance would be a nightmare.

我不确定是否有合理的方法来批量处理一组重复;大多数操作都需要多个输入(调用者不知道)并产生单个输出。

I'm not sure if there's a reasonable way to batch up a set of repetitions; most of the operations take multiple inputs (unknown to the caller) and produce a single output.

我还有哪些其他选项可以减少开销和夜间性能?

What other options do I have for minimizing the overhead and evening out performance?

推荐答案

方法调用不是问题,因为热方法经常被内联。 虚拟调用是一个问题。

A method call is not a problem since hot methods are often inlined. A virtual call is an issue.

在您的代码中,类型分析器被初始化方法愚弄 Image.random 。当 Image.process 首次进行JIT编译时,它会针对调用 random.nextInt()进行优化。因此, Image.process 的下一次调用将导致内联缓存未命中,然后是对 Shader.apply

In your code the type profiler is fooled by the initialization method Image.random. When Image.process is JIT-compiled for the first time, it is optimized for calling random.nextInt(). So the next invocations of Image.process will result in the inline-cache miss followed by a non-optimized virtual call to Shader.apply.


  1. 从中删除 Image.process 来电初始化方法和JIT将内联对 Shader.apply 的有用调用。

  1. Remove an Image.process call from the initialization method and JIT will then inline the useful calls to Shader.apply.

<$> c $ c> BlurShader.apply 内联您可以帮助JIT执行公共子表达式消除优化通过替换

After BlurShader.apply is inlined you can help JIT to perform Common subexpression elimination optimization by replacing

final int p = s * y + x;

with

final int p = y * s + x;

后面的表达式也在 Image.process ,所以JIT不会计算两次相同的表达式。

The latter expression is also met in Image.process, so JIT will not calculate the same expression twice.

应用这两个更改后,我已经实现了理想的基准分数:

After applying these two changes I've achieved the ideal benchmark score:

Benchmark                           Mode   Samples         Mean   Mean error    Units
s.ShaderBench.testProcessInline    thrpt         5       36,483        1,255    ops/s
s.ShaderBench.testProcessLambda    thrpt         5       36,323        0,936    ops/s
s.ShaderBench.testProcessProc      thrpt         5       36,163        1,421    ops/s

这篇关于最小化Java函数调用开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆