Java:手动展开的循环仍比原始循环快.为什么? [英] Java: manually-unrolled loop is still faster than the original loop. Why?
问题描述
在长度为2的数组上考虑以下两个代码段:
boolean isOK(int i) {
for (int j = 0; j < filters.length; ++j) {
if (!filters[j].isOK(i)) {
return false;
}
}
return true;
}
和
boolean isOK(int i) {
return filters[0].isOK(i) && filters[1].isOK(i);
}
我认为,经过充分的预热后,这两块琴的性能应该相似.
TL; DR 此处性能差异的主要原因与循环展开无关.而是类型推测和内联缓存. 实际上,在HotSpot术语中,此类循环被视为已计数,在某些情况下,JVM 可以将其展开.不过,您的情况并非如此. HotSpot具有两种循环展开策略:1)最大限度地展开,即完全移除循环;或2)将几个连续的迭代粘合在一起. 仅当以下条件会中断展开: 由于在您的情况下,预期的行程计数少于2,因此HotSpot假设即使展开两次迭代也不值得.请注意,无论如何,第一次迭代都会提取到预循环中(循环剥离优化),因此展开是在这里确实不是很友善. 在您展开的版本中,有两个不同的 在循环中,只有一个 如果您拥有3个或更多不同的过滤器,情况将会更加糟糕.在这种情况下, 有关(Java)方法的黑魔法的文章,有关投机内联的更多信息派遣. 为了内联虚拟/接口调用,HotSpot JVM按调用字节码收集类型配置文件.如果循环中存在虚拟调用,则无论循环是否展开,该调用都将只有一个类型配置文件. 要从虚拟调用优化中获得最大收益,您需要手动拆分循环,主要是为了拆分类型配置文件.到目前为止,HotSpot无法自动执行此操作. Consider the following two snippets of code on an array of length 2: and I would assume that the performance of these two pieces should be similar after sufficient warm-up. Question: why hasn't Java optimized my first snippet using the basic loop unrolling technique? Ideally, I would like to receive an answer from someone with a deep understanding of how JITC works. Benchmark run details: Typical benchmark output: Benchmark (filterIndex) Mode Cnt Score
Error Units (The first line corresponds to the first snippet, the second line - to the second. Complete benchmark code:
TL;DR The main reason of performance difference here is not related to loop unrolling. It is rather the type speculation and the inline caches. In fact, in HotSpot terminology, such loops are treated as counted, and in certain cases JVM can unroll them. Not in your case though. HotSpot has two loop unrolling strategies: 1) unroll maximally, i.e. remove the loop altogether; or 2) glue several consecutive iterations together. Maximal unrolling can be done, only if the exact number of iterations is known. In your case, however, the function may return early after the first iteration. Partial unrolling could be probably applied, but the following condition breaks unrolling: Since in your case the expected trip count is less than 2, HotSpot assumes it's not worthy to unroll even two iterations. Note that the first iteration is extracted into pre-loop anyway (loop peeling optimization), so unrolling is indeed not very benificial here. In your unrolled version, there are two different With the loop, there is just one Things will be even worse, if you have 3 or more different filters. In this case More about speculative inlining in the article The Black Magic of (Java) Method Dispatch. In order to inline virtual/interface calls, HotSpot JVM collects type profiles per invoke bytecode. If there is a virtual call in a loop, there will be just one type profile for the call, no matter if the loop is unrolled or not. To get the best from the virtual call optimizations, you'd need to manually split the loop, primarily for the purpose of splitting type profiles. HotSpot cannot do this automatically so far. 这篇关于Java:手动展开的循环仍比原始循环快.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
我已经使用JMH微型基准测试框架对此进行了检查,例如此处和展开策略
// Don't unroll if the next round of unrolling would push us
// over the expected trip count of the loop. One is subtracted
// from the expected trip count because the pre-loop normally
// executes 1 iteration.
if (UnrollLimitForProfileCheck > 0 &&
cl->profile_trip_cnt() != COUNT_UNKNOWN &&
future_unroll_ct > UnrollLimitForProfileCheck &&
(float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) {
return false;
}
类型推测
invokeinterface
字节码.这些站点具有两个不同的类型配置文件.第一个接收器始终为Filter1
,第二个接收器始终为Filter2
.因此,您基本上有两个单态调用站点,并且HotSpot可以完美内联这两个调用-所谓的内联缓存",在这种情况下,命中率达到100%.
invokeinterface
字节码,并且只收集了一个类型概要文件. HotSpot JVM发现,使用Filter1
接收器调用filters[j].isOK()
的次数为86%,使用Filter2
接收器调用的次数为14%.这将是双态调用.幸运的是,HotSpot也可以推测性地内联双态调用.它使用条件分支内联两个目标.但是,在这种情况下,命中率最多为86%,并且性能会受到体系结构级别上相应的错误预测分支的影响.isOK()
将是HotSpot根本无法内联的宏调用.因此,编译后的代码将包含一个真正的接口调用,这会对性能产生较大的影响.结论
boolean isOK(int i) {
for (int j = 0; j < filters.length; ++j) {
if (!filters[j].isOK(i)) {
return false;
}
}
return true;
}
boolean isOK(int i) {
return filters[0].isOK(i) && filters[1].isOK(i);
}
I've checked this using JMH micro-benchmarking framework as described e.g. here and here and observed that the second snippet is more than 10% faster.
In particular, I'd like to understand the following:
return (filters.length) == 2 ? new FilterChain2(filters) : new FilterChain1(filters)
. Can JITC do the same and if not, why?
Update: got an answer that JITC works only on a class level. OK, got it.
LoopUnrollingBenchmark.runBenchmark 0
avgt 400 44.202 ± 0.224 ns/op
LoopUnrollingBenchmark.runBenchmark 1 avgt 400 38.347
± 0.063 ns/oppublic class LoopUnrollingBenchmark {
@State(Scope.Benchmark)
public static class BenchmarkData {
public Filter[] filters;
@Param({"0", "1"})
public int filterIndex;
public int num;
@Setup(Level.Invocation) //similar ratio with Level.TRIAL
public void setUp() {
filters = new Filter[]{new FilterChain1(), new FilterChain2()};
num = new Random().nextInt();
}
}
@Benchmark
@Fork(warmups = 5, value = 20)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public int runBenchmark(BenchmarkData data) {
Filter filter = data.filters[data.filterIndex];
int sum = 0;
int num = data.num;
if (filter.isOK(num)) {
++sum;
}
if (filter.isOK(num + 1)) {
++sum;
}
if (filter.isOK(num - 1)) {
++sum;
}
if (filter.isOK(num * 2)) {
++sum;
}
if (filter.isOK(num * 3)) {
++sum;
}
if (filter.isOK(num * 5)) {
++sum;
}
return sum;
}
interface Filter {
boolean isOK(int i);
}
static class Filter1 implements Filter {
@Override
public boolean isOK(int i) {
return i % 3 == 1;
}
}
static class Filter2 implements Filter {
@Override
public boolean isOK(int i) {
return i % 7 == 3;
}
}
static class FilterChain1 implements Filter {
final Filter[] filters = createLeafFilters();
@Override
public boolean isOK(int i) {
for (int j = 0; j < filters.length; ++j) {
if (!filters[j].isOK(i)) {
return false;
}
}
return true;
}
}
static class FilterChain2 implements Filter {
final Filter[] filters = createLeafFilters();
@Override
public boolean isOK(int i) {
return filters[0].isOK(i) && filters[1].isOK(i);
}
}
private static Filter[] createLeafFilters() {
Filter[] filters = new Filter[2];
filters[0] = new Filter1();
filters[1] = new Filter2();
return filters;
}
public static void main(String[] args) throws Exception {
org.openjdk.jmh.Main.main(args);
}
}
Unrolling strategies
if (!cl->has_exact_trip_count()) {
// Trip count is not exact.
return false;
}
// Don't unroll if the next round of unrolling would push us
// over the expected trip count of the loop. One is subtracted
// from the expected trip count because the pre-loop normally
// executes 1 iteration.
if (UnrollLimitForProfileCheck > 0 &&
cl->profile_trip_cnt() != COUNT_UNKNOWN &&
future_unroll_ct > UnrollLimitForProfileCheck &&
(float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) {
return false;
}
Type speculation
invokeinterface
bytecodes. These sites have two distinct type profiles. The first receiver is always Filter1
, and the second receiver is always Filter2
. So, you basically have two monomorphic call sites, and HotSpot can perfectly inline both calls - so called "inline cache" which has 100% hit ratio in this case.invokeinterface
bytecode, and only one type profile is collected. HotSpot JVM sees that filters[j].isOK()
is called 86% times with Filter1
receiver and 14% times with Filter2
receiver. This will be a bimorphic call. Fortunately, HotSpot can speculatively inline bimorphic calls, too. It inlines both targets with a conditional branch. However, in this case the hit ratio will be at most 86%, and the performance will suffer from the corresponding mispredicted branches at the architecture level.isOK()
will be a megamorphic call which HotSpot cannot inline at all. So, the compiled code will contain a true interface call which has a larger performance impact.Conclusion