为什么在预热阶段浮点运算要快得多? [英] Why are floating point operations much faster with a warmup phase?

查看:187
本文介绍了为什么在预热阶段浮点运算要快得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最初想要测试一些与Java中的浮点性能优化不同的东西,即除以 5.0f 和与 0.2f (乘法似乎没有预热,但速度较快,分别为1.5左右)。

在研究结果后,我注意到我忘了添加一个热身阶段,正如经常进行性能优化时所建议的那样,所以我添加了它。而且,令我惊讶的是,在多次测试中平均快了25倍。



我用下面的代码测试了它:



pre code $ public static void main(String args [])
{
float [] test = new float [10000];
float [] test_copy;

// warmup
for(int i = 0; i <1000; i ++)
{
fillRandom(test);

test_copy = test.clone();

divideByTwo(test);
multiplyWithOneHalf(test_copy);
}

long divisionTime = 0L;
long multiplicationTime = 0L;

for(int i = 0; i <1000; i ++)
{
fillRandom(test);

test_copy = test.clone();

divisionTime + = divideByTwo(test);
multiplicationTime + = multiplyWithOneHalf(test_copy);
}

System.out.println(除以5.0f:+ divisionTime);
System.out.println(乘以0.2f:+ multiplicationTime);


public static long divideByTwo(float [] data)
{
long before = System.nanoTime();

(float f:data)
{
f / = 5.0f;
}

return System.nanoTime() - before;


public static long multiplyWithOneHalf(float [] data)
{
long before = System.nanoTime();

(float f:data)
{
f * = 0.2f;
}

return System.nanoTime() - before;


public static void fillRandom(float [] data)
{
Random random = new Random();

(float f:data)
{
f = random.nextInt()* random.nextFloat();





结果 without 热身阶段

 除以5.0f:382224 
与0.2f相乘:490765

结果热身阶段

 除以5.0f:22081 
乘以0.2f:10885

我无法解释的另一个有趣的变化是什么操作更快(分割与乘法)的转向。如前所述,没有热身,分区似乎有点快,而在热身之前,似乎是慢了一倍。

我尝试添加一个初始化块,将值设置为随机值,但不会影响结果,也不会添加多个预热阶段。方法操作的数字是相同的,所以不能成为理由。



这种行为的原因是什么?这是什么热身阶段,它是如何影响性能,为什么是一个热身阶段的操作如此之快,为什么有一个转折,其中操作更快?



难题的第二部分是热点也会记录统计信息,这些统计信息会衡量代码的运行时行为,当它决定优化代码时,将使用这些统计信息执行优化,这些优化在编译时不一定可行。例如,它可以降低空检查,分支预测错误和多态方法调用的成本。简而言之,必须抛弃预热的结果。

Brian Goetz写了一篇很好的文章 here here on this subject。



========



附录:什么是'JVM热身'的概述

JVM'热身'是一个松散的词组,不再严格地说是一个阶段或阶段JVM。人们倾向于使用它来指JVM字节代码编译为本地字节代码之后,JVM性能稳定的位置。事实上,当一个人开始在表面下划伤,深入研究JVM内部时,很难让Hotspot为我们做的事情留下深刻的印象。我的目标是给你更好的感觉,Hotspot能以表演的名义做更多的细节,我建议你阅读Brian Goetz,Doug Lea,John Rose,Cliff Click和Gil Tene等人的文章。

如前所述,JVM通过运行Java的解释器启动。虽然严格来说不是100%正确的,但是可以将解释器看作是一个大的开关语句和一个遍历每个JVM字节码(命令)的循环。 switch语句中的每种情况都是JVM字节码,例如将两个值相加,调用方法,调用构造函数等等。迭代的开销和跳过命令非常大。因此执行单个命令通常会使用10倍以上的汇编命令,这意味着硬件执行速度要慢10倍以上,因为硬件必须执行如此多的命令和高速缓存才会被这个解释器代码所污染,理想情况下,我们更愿意将注意力集中在我们的实际程序上。回想一下Java早期Java获得非常缓慢的声誉;这是因为它最初只是一个完全解释的语言。

稍后JIT编译器被添加到Java中,这些编译器会在调用方法之前将Java方法编译为本地CPU指令。这消除了解释器的所有开销,并允许执行代码在硬件中执行。虽然在硬件内执行速度要快得多,但是这个额外的编译在Java启动时创建了一个停顿。这部分是热身阶段术语的一部分。

在JVM中引入Hotspot是一个改变游戏规则的游戏。现在JVM的启动速度会更快,因为它将开始运行带有解释器的Java程序,并且单独的Java方法将在后台线程中编译并在执行期间实时交换出去。本地代码的生成也可以通过不同级别的优化来完成,有时使用非常积极的优化(严格来说是不正确的),然后在必要时去动优化和重新优化以确保正确的行为。例如,类层次结构意味着花费很大的代价来确定将调用哪个方法,因为热点必须搜索层次结构并定位目标方法。这里的热点可以变得非常聪明,如果它注意到只有一个类已经被加载,那么它可以认为总是这样,优化和内联方法是这样的。如果另一个类被载入,现在告诉Hotspot实际上在两个方法之间做出了决定,那么它将删除它以前的假设,并在飞行中重新编译。在不同情况下可以做出的优化的完整列表是非常令人印象深刻的,并且在不断变化。热点记录有关正在运行的环境的信息和统计信息的能力以及当前正在经历的工作负载使得优化被执行得非常灵活和动态。实际上,在单个Java进程的整个生命周期内,这个程序的代码很可能随着工作负载的变化而重新生成很多次。可以说Hotspot比传统的静态编译有更大的优势,这也是很大程度上为什么很多Java代码可以被认为和编写C代码一样快的原因。这也使得理解microbenchmarks更难;实际上这使得Oracle的维护人员对JVM代码本身的理解,处理和诊断问题变得更加困难。花一分钟给这些人提一个品脱,Hotspot和JVM整体上是一个奇妙的工程胜利,当人们说这是不可能完成的时候,这个胜利就浮出水面。值得记住的是,由于十年左右的时间,这是一个相当复杂的野兽;)

因此,鉴于上下文,总之,我们指的是在微基准作为运行目标代码超过10K次,并将结果扔掉,从而使JVM有机会收集统计数据并优化代码的热区。 10k是一个幻数,因为Server Hotspot实现在开始考虑优化之前等待许多方法调用或循环迭代。我还会建议在核心测试运行之间进行方法调用,因为热点可以做堆栈替换(

I initially wanted to test something different with floating-point performance optimisation in Java, namely the performance difference between the division by 5.0f and multiplication with 0.2f (multiplication seems to be slower without warm-up but faster with by a factor of about 1.5 respectively).

After studying the results I noticed that I had forgotten to add a warm-up phase, as suggested so often when doing performance optimisations, so I added it. And, to my utter surprise, it turned out to be about 25 times faster in average over multiple test runs.

I tested it with the following code:

public static void main(String args[])
{
    float[] test = new float[10000];
    float[] test_copy;

    //warmup
    for (int i = 0; i < 1000; i++)
    {
        fillRandom(test);

        test_copy = test.clone();

        divideByTwo(test);
        multiplyWithOneHalf(test_copy);
    }

    long divisionTime = 0L;
    long multiplicationTime = 0L;

    for (int i = 0; i < 1000; i++)
    {
        fillRandom(test);

        test_copy = test.clone();

        divisionTime += divideByTwo(test);
        multiplicationTime += multiplyWithOneHalf(test_copy);
    }

    System.out.println("Divide by 5.0f: " + divisionTime);
    System.out.println("Multiply with 0.2f: " + multiplicationTime);
}

public static long divideByTwo(float[] data)
{
    long before = System.nanoTime();

    for (float f : data)
    {
        f /= 5.0f;
    }

    return System.nanoTime() - before;
}

public static long multiplyWithOneHalf(float[] data)
{
    long before = System.nanoTime();

    for (float f : data)
    {
        f *= 0.2f;
    }

    return System.nanoTime() - before;
}

public static void fillRandom(float[] data)
{
    Random random = new Random();

    for (float f : data)
    {
        f = random.nextInt() * random.nextFloat();
    }
}

Results without warm-up phase:

Divide by 5.0f: 382224
Multiply with 0.2f: 490765

Results with warm-up phase:

Divide by 5.0f: 22081
Multiply with 0.2f: 10885

Another interesting change that I cannot explain is the turn in what operation is faster (division vs. multiplication). As earlier mentioned, without the warm-up the division seems to be a tad faster, while with the warm-up it seems to be twice as slow.

I tried adding an initialization block setting the values to something random, but it didn't not effect the results and neither did adding multiple warm-up phases. The numbers on which the methods operate are the same, so that cannot be the reason.

What is the reason for this behaviour? What is this warm-up phase and how does it influence the performance, why are the operations so much faster with a warm-up phase and why is there a turn in which operation is faster?

解决方案

Before the warm up Java will be running the byte codes via an interpreter, think how you would write a program that could execute java byte codes in java. After warm up, hotspot will have generated native assembler for the cpu that you are running on; making use of that cpus feature set. There is a significant performance difference between the two, the interpreter will run many many cpu instructions for a single byte code where as hotspot generates native assembler code just as gcc does when compiling C code. That is the difference between the time to divide and to multiply will ultimately be down to the CPU that one is running on, and it will be just a single cpu instruction.

The second part to the puzzle is hotspot also records statistics that measure the runtime behaviour of your code, when it decides to optimise the code then it will use those statistics to perform optimisations that are not necessarily possible at compilation time. For example it can reduce the cost of null checks, branch mispredictions and polymorphic method invocation.

In short, one must discard the results pre-warmup.

Brian Goetz wrote a very good article here on this subject.

========

APPENDED: overview of what 'JVM Warm-up' means

JVM 'warm up' is a loose phrase, and is no longer strictly speaking a single phase or stage of the JVM. People tend to use it to refer to the idea of where JVM performance stabilizes after compilation of the JVM byte codes to native byte codes. In truth, when one starts to scratch under the surface and delves deeper into the JVM internals it is difficult not to be impressed by how much Hotspot is doing for us. My goal here is just to give you a better feel for what Hotspot can do in the name of performance, for more details I recommend reading articles by Brian Goetz, Doug Lea, John Rose, Cliff Click and Gil Tene (amongst many others).

As already mentioned, the JVM starts by running Java through its interpreter. While strictly speaking not 100% correct, one can think of an interpreter as a large switch statement and a loop that iterates over every JVM byte code (command). Each case within the switch statement is a JVM byte code such as add two values together, invoke a method, invoke a constructor and so forth. The overhead of the iteration, and jumping around the commands is very large. Thus execution of a single command will typically use over 10x more assembly commands, which means > 10x slower as the hardware has to execute so many more commands and caches will get polluted by this interpreter code which ideally we would rather focused on our actual program. Think back to the early days of Java when Java earned its reputation of being very slow; this is because it was originally a fully interpreted language only.

Later on JIT compilers were added to Java, these compilers would compile Java methods to native CPU instructions just before the methods were invoked. This removed all of the overhead of the interpreter and allowed the execution of code to be performed in hardware. While execution within hardware is much faster, this extra compilation created a stall on startup for Java. And this was partly where the terminology of 'warm up phase' took hold.

The introduction of Hotspot to the JVM was a game changer. Now the JVM would start up faster because it would start life running the Java programs with its interpreter and individual Java methods would be compiled in a background thread and swapped out on the fly during execution. The generation of native code could also be done to differing levels of optimisation, sometimes using very aggressive optimisations that are strictly speaking incorrect and then de-optimising and re-optimising on the fly when necessary to ensure correct behaviour. For example, class hierarchies imply a large cost to figuring out which method will be called as Hotspot has to search the hierarchy and locate the target method. Hotspot can become very clever here, and if it notices that only one class has been loaded then it can assume that will always be the case and optimise and inline methods as such. Should another class get loaded that now tells Hotspot that there is actually a decision between two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimisations that can be made under different circumstances is very impressive, and is constantly changing. Hotspot's ability to record information and statistics about the environment that it is running in, and the work load that it is currently experiencing makes the optimisations that are performed very flexible and dynamic. In fact it is very possible that over the life time of a single Java process, that the code for that program will be regenerated many times over as the nature of its work load changes. Arguably giving Hotspot a large advantage over more traditional static compilation, and is largely why a lot of Java code can be considered to be just as fast as writing C code. It also makes understanding microbenchmarks a lot harder; in fact it makes the JVM code itself much more difficult for the maintainers at Oracle to understand, work with and diagnose problems. Take a minute to raise a pint to those guys, Hotspot and the JVM as a whole is a fantastic engineering triumph that rose to the fore at a time when people were saying that it could not be done. It is worth remembering that, because after a decade or so it is quite a complex beast ;)

So given that context, in summary we refer to warming up a JVM in microbenchmarks as running the target code over 10k times and throwing the results away so as to give the JVM a chance to collect statistics and to optimise the 'hot regions' of the code. 10k is a magic number because the Server Hotspot implementation waits for that many method invocations or loop iterations before it starts to consider optimisations. I would also advice on having method calls between the core test runs, as while hotspot can do 'on stack replacement' (OSR), it is not common in real applications and it does not behave exactly the same as swapping out whole implementations of methods.

这篇关于为什么在预热阶段浮点运算要快得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆