Java:手动展开的循环仍比原始循环快.为什么? [英] Java: manually-unrolled loop is still faster than the original loop. Why?

查看:98
本文介绍了Java:手动展开的循环仍比原始循环快.为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在长度为2的数组上考虑以下两个代码段:

boolean isOK(int i) {
    for (int j = 0; j < filters.length; ++j) {
        if (!filters[j].isOK(i)) {
            return false;
        }
    }
    return true;
}

boolean isOK(int i) {
     return filters[0].isOK(i) && filters[1].isOK(i);
}

我认为,经过充分的预热后,这两块琴的性能应该相似.
我已经使用JMH微型基准测试框架对此进行了检查,例如此处

TL; DR 此处性能差异的主要原因与循环展开无关.而是类型推测内联缓存.

展开策略

实际上,在HotSpot术语中,此类循环被视为已计数,在某些情况下,JVM 可以将其展开.不过,您的情况并非如此.

HotSpot具有两种循环展开策略:1)最大限度地展开,即完全移除循环;或2)将几个连续的迭代粘合在一起.

仅当循环剥离优化),因此展开是在这里确实不是很友善.

类型推测

在您展开的版本中,有两个不同的invokeinterface字节码.这些站点具有两个不同的类型配置文件.第一个接收器始终为Filter1,第二个接收器始终为Filter2.因此,您基本上有两个单态调用站点,并且HotSpot可以完美内联这两个调用-所谓的内联缓存",在这种情况下,命中率达到100%.

在循环中,只有一个invokeinterface字节码,并且只收集了一个类型概要文件. HotSpot JVM发现,使用Filter1接收器调用filters[j].isOK()的次数为86%,使用Filter2接收器调用的次数为14%.这将是双态调用.幸运的是,HotSpot也可以推测性地内联双态调用.它使用条件分支内联两个目标.但是,在这种情况下,命中率最多为86%,并且性能会受到体系结构级别上相应的错误预测分支的影响.

如果您拥有3个或更多不同的过滤器,情况将会更加糟糕.在这种情况下,isOK()将是HotSpot根本无法内联的宏调用.因此,编译后的代码将包含一个真正的接口调用,这会对性能产生较大的影响.

有关(Java)方法的黑魔法的文章,有关投机内联的更多信息派遣.

结论

为了内联虚拟/接口调用,HotSpot JVM按调用字节码收集类型配置文件.如果循环中存在虚拟调用,则无论循环是否展开,该调用都将只有一个类型配置文件.

要从虚拟调用优化中获得最大收益,您需要手动拆分循环,主要是为了拆分类型配置文件.到目前为止,HotSpot无法自动执行此操作.

Consider the following two snippets of code on an array of length 2:

boolean isOK(int i) {
    for (int j = 0; j < filters.length; ++j) {
        if (!filters[j].isOK(i)) {
            return false;
        }
    }
    return true;
}

and

boolean isOK(int i) {
     return filters[0].isOK(i) && filters[1].isOK(i);
}

I would assume that the performance of these two pieces should be similar after sufficient warm-up.
I've checked this using JMH micro-benchmarking framework as described e.g. here and here and observed that the second snippet is more than 10% faster.

Question: why hasn't Java optimized my first snippet using the basic loop unrolling technique?
In particular, I'd like to understand the following:

  1. I can easily produce a code that is optimal for cases of 2 filters and still can work in case of another number of filters (imagine a simple builder):
    return (filters.length) == 2 ? new FilterChain2(filters) : new FilterChain1(filters). Can JITC do the same and if not, why?
  2. Can JITC detect that 'filters.length==2' is the most frequent case and produce the code that is optimal for this case after some warm-up? This should be almost as optimal as the manually-unrolled version.
  3. Can JITC detect that a particular instance is used very frequently and then produce a code for this specific instance (for which it knows that the number of filters is always 2)?
    Update: got an answer that JITC works only on a class level. OK, got it.

Ideally, I would like to receive an answer from someone with a deep understanding of how JITC works.

Benchmark run details:

  • Tried on latest versions of Java 8 OpenJDK and Oracle HotSpot, the results are similar
  • Used Java flags: -Xmx4g -Xms4g -server -Xbatch -XX:CICompilerCount=2 (got similar results without the fancy flags as well)
  • By the way, I get similar run time ratio if I simply run it several billion times in a loop (not via JMH), i.e. the second snippet is always clearly faster

Typical benchmark output:

Benchmark (filterIndex) Mode Cnt Score Error Units
LoopUnrollingBenchmark.runBenchmark 0 avgt 400 44.202 ± 0.224 ns/op
LoopUnrollingBenchmark.runBenchmark 1 avgt 400 38.347 ± 0.063 ns/op

(The first line corresponds to the first snippet, the second line - to the second.

Complete benchmark code:

public class LoopUnrollingBenchmark {

    @State(Scope.Benchmark)
    public static class BenchmarkData {
        public Filter[] filters;
        @Param({"0", "1"})
        public int filterIndex;
        public int num;

        @Setup(Level.Invocation) //similar ratio with Level.TRIAL
        public void setUp() {
            filters = new Filter[]{new FilterChain1(), new FilterChain2()};
            num = new Random().nextInt();
        }
    }

    @Benchmark
    @Fork(warmups = 5, value = 20)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public int runBenchmark(BenchmarkData data) {
        Filter filter = data.filters[data.filterIndex];
        int sum = 0;
        int num = data.num;
        if (filter.isOK(num)) {
            ++sum;
        }
        if (filter.isOK(num + 1)) {
            ++sum;
        }
        if (filter.isOK(num - 1)) {
            ++sum;
        }
        if (filter.isOK(num * 2)) {
            ++sum;
        }
        if (filter.isOK(num * 3)) {
            ++sum;
        }
        if (filter.isOK(num * 5)) {
            ++sum;
        }
        return sum;
    }


    interface Filter {
        boolean isOK(int i);
    }

    static class Filter1 implements Filter {
        @Override
        public boolean isOK(int i) {
            return i % 3 == 1;
        }
    }

    static class Filter2 implements Filter {
        @Override
        public boolean isOK(int i) {
            return i % 7 == 3;
        }
    }

    static class FilterChain1 implements Filter {
        final Filter[] filters = createLeafFilters();

        @Override
        public boolean isOK(int i) {
            for (int j = 0; j < filters.length; ++j) {
                if (!filters[j].isOK(i)) {
                    return false;
                }
            }
            return true;
        }
    }

    static class FilterChain2 implements Filter {
        final Filter[] filters = createLeafFilters();

        @Override
        public boolean isOK(int i) {
            return filters[0].isOK(i) && filters[1].isOK(i);
        }
    }

    private static Filter[] createLeafFilters() {
        Filter[] filters = new Filter[2];
        filters[0] = new Filter1();
        filters[1] = new Filter2();
        return filters;
    }

    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

解决方案

TL;DR The main reason of performance difference here is not related to loop unrolling. It is rather the type speculation and the inline caches.

Unrolling strategies

In fact, in HotSpot terminology, such loops are treated as counted, and in certain cases JVM can unroll them. Not in your case though.

HotSpot has two loop unrolling strategies: 1) unroll maximally, i.e. remove the loop altogether; or 2) glue several consecutive iterations together.

Maximal unrolling can be done, only if the exact number of iterations is known.

  if (!cl->has_exact_trip_count()) {
    // Trip count is not exact.
    return false;
  }

In your case, however, the function may return early after the first iteration.

Partial unrolling could be probably applied, but the following condition breaks unrolling:

  // Don't unroll if the next round of unrolling would push us
  // over the expected trip count of the loop.  One is subtracted
  // from the expected trip count because the pre-loop normally
  // executes 1 iteration.
  if (UnrollLimitForProfileCheck > 0 &&
      cl->profile_trip_cnt() != COUNT_UNKNOWN &&
      future_unroll_ct        > UnrollLimitForProfileCheck &&
      (float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) {
    return false;
  }

Since in your case the expected trip count is less than 2, HotSpot assumes it's not worthy to unroll even two iterations. Note that the first iteration is extracted into pre-loop anyway (loop peeling optimization), so unrolling is indeed not very benificial here.

Type speculation

In your unrolled version, there are two different invokeinterface bytecodes. These sites have two distinct type profiles. The first receiver is always Filter1, and the second receiver is always Filter2. So, you basically have two monomorphic call sites, and HotSpot can perfectly inline both calls - so called "inline cache" which has 100% hit ratio in this case.

With the loop, there is just one invokeinterface bytecode, and only one type profile is collected. HotSpot JVM sees that filters[j].isOK() is called 86% times with Filter1 receiver and 14% times with Filter2 receiver. This will be a bimorphic call. Fortunately, HotSpot can speculatively inline bimorphic calls, too. It inlines both targets with a conditional branch. However, in this case the hit ratio will be at most 86%, and the performance will suffer from the corresponding mispredicted branches at the architecture level.

Things will be even worse, if you have 3 or more different filters. In this case isOK() will be a megamorphic call which HotSpot cannot inline at all. So, the compiled code will contain a true interface call which has a larger performance impact.

More about speculative inlining in the article The Black Magic of (Java) Method Dispatch.

Conclusion

In order to inline virtual/interface calls, HotSpot JVM collects type profiles per invoke bytecode. If there is a virtual call in a loop, there will be just one type profile for the call, no matter if the loop is unrolled or not.

To get the best from the virtual call optimizations, you'd need to manually split the loop, primarily for the purpose of splitting type profiles. HotSpot cannot do this automatically so far.

这篇关于Java:手动展开的循环仍比原始循环快.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆