Arrays.stream().map().sum() 的性能不稳定 [英] Erratic performance of Arrays.stream().map().sum()

查看:183
本文介绍了Arrays.stream().map().sum() 的性能不稳定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偶然发现了一个对原始数组执行非常简单的 map/reduce 操作的极其不稳定的性能配置文件的实例.这是我的 jmh 基准代码:

I have chanced upon an instance of exceedingly erratic performance profile of a very simple map/reduce operation on primitive arrays. Here is my jmh benchmark code:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.ARRAY_SIZE)
@Warmup(iterations = 300, time = 200, timeUnit=MILLISECONDS)
@Measurement(iterations = 1, time = 1000, timeUnit=MILLISECONDS)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
public class Measure
{
  static final int ARRAY_SIZE = 1<<20;
  final int[] ds = new int[ARRAY_SIZE];

  private IntUnaryOperator mapper;

  @Setup public void setup() {
    setAll(ds, i->(int)(Math.random()*(1<<7)));
    final int multiplier = (int)(Math.random()*10);
    mapper = d -> multiplier*d;
  }

  @Benchmark public double multiply() {
    return Arrays.stream(ds).map(mapper).sum();
  }
}

这里是典型输出的片段:

And here are the snippets of the typical output:

# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 300 iterations, 200 ms each
# Measurement: 1 iterations, 1000 ms each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.sample.Measure.multiply

# Run progress: 0,00% complete, ETA 00:01:01
# Fork: 1 of 1
# Warmup Iteration   1: 0,779 ns/op
# Warmup Iteration   2: 0,684 ns/op
# Warmup Iteration   3: 0,608 ns/op
# Warmup Iteration   4: 0,619 ns/op
# Warmup Iteration   5: 0,642 ns/op
# Warmup Iteration   6: 0,638 ns/op
# Warmup Iteration   7: 0,660 ns/op
# Warmup Iteration   8: 0,611 ns/op
# Warmup Iteration   9: 0,636 ns/op
# Warmup Iteration  10: 0,692 ns/op
# Warmup Iteration  11: 0,632 ns/op
# Warmup Iteration  12: 0,612 ns/op
# Warmup Iteration  13: 1,280 ns/op
# Warmup Iteration  14: 7,261 ns/op
# Warmup Iteration  15: 7,379 ns/op
# Warmup Iteration  16: 7,376 ns/op
# Warmup Iteration  17: 7,379 ns/op
# Warmup Iteration  18: 7,195 ns/op
# Warmup Iteration  19: 7,351 ns/op
# Warmup Iteration  20: 7,761 ns/op
....
....
....
# Warmup Iteration 100: 7,300 ns/op
# Warmup Iteration 101: 7,384 ns/op
# Warmup Iteration 102: 7,132 ns/op
# Warmup Iteration 103: 7,278 ns/op
# Warmup Iteration 104: 7,331 ns/op
# Warmup Iteration 105: 7,335 ns/op
# Warmup Iteration 106: 7,450 ns/op
# Warmup Iteration 107: 7,346 ns/op
# Warmup Iteration 108: 7,826 ns/op
# Warmup Iteration 109: 7,221 ns/op
# Warmup Iteration 110: 8,017 ns/op
# Warmup Iteration 111: 7,611 ns/op
# Warmup Iteration 112: 7,376 ns/op
# Warmup Iteration 113: 0,707 ns/op
# Warmup Iteration 114: 0,828 ns/op
# Warmup Iteration 115: 0,608 ns/op
# Warmup Iteration 116: 0,634 ns/op
# Warmup Iteration 117: 0,633 ns/op
# Warmup Iteration 118: 0,660 ns/op
# Warmup Iteration 119: 0,635 ns/op
# Warmup Iteration 120: 0,566 ns/op

关键时刻发生在迭代 13 和 113:首先性能下降了 10 倍,然后又恢复了.相应的时间是进入测试运行的 2.5 秒和 22.5 秒.这些事件的时间对数组大小非常敏感,顺便说一句.

The key moments happen at iterations 13 and 113: first the performance is degraded by a factor of ten, then it is restored. The corresponding times are 2.5 and 22.5 seconds into the test run. The timing of these events is very sensitive to the array size, BTW.

什么可以解释这种行为?JIT 编译器可能在第一次迭代中就完成了它的工作;没有可提及的 GC 操作(由 VisualVM 确认)......我对任何类型的解释都完全不知所措.

What can possibly explain this sort of behavior? The JIT compiler had probably done its work within the first iteration; there is no GC action to speak of (confirmed by VisualVM)... I am at a complete loss as to any kind of explanation.

我的 Java 版本 (OS X):

My version of Java (OS X):

$ java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

推荐答案

JIT 将首先编译正在对数组元素进行迭代和操作 (map/reduce) 的热循环.这很早就发生了,因为数组包含 220 个元素.

The JIT will first compile the hot loop that is iterating over and operating (map/reduce) on the array elements. This happens quite early on since the array contains 220 elements.

稍后 JIT 编译管道,很可能在编译的基准测试方法中内联,并且由于内联限制未能将其全部编译为一个方法.碰巧在热循环中达到了内联限制,并且未内联对 map 或 sum 的调用,因此热循环被无意中去优化"了.

Later on the JIT compiles the pipeline, most likely inlined within the compiled benchmark method, and due to inlining limits fails to compile it all into one method. It just so happens those inlining limits are reached in the hot loop, and calls to map or sum are not inlined, so the hot loop is unintentionally "de-optimized".

在运行基准测试时使用选项 -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining,您应该会看到如下输出:

Use the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining when running the benchmark and early on you should see output like the following:

   1202  487 %     4       java.util.Spliterators$IntArraySpliterator::forEachRemaining @ 49 (68 bytes)
                              @ 53   java.util.stream.IntPipeline$3$1::accept (23 bytes)   inline (hot)
                               -> TypeProfile (1186714/1186714 counts) = java/util/stream/IntPipeline$3$1
                                @ 12   test.Measure$$Lambda$2/1745776415::applyAsInt (9 bytes)   inline (hot)
                                 -> TypeProfile (1048107/1048107 counts) = test/Measure$$Lambda$2
                                  @ 5   test.Measure::lambda$setup$1 (4 bytes)   inline (hot)
                                @ 17   java.util.stream.ReduceOps$5ReducingSink::accept (19 bytes)   inline (hot)
                                 -> TypeProfile (1048107/1048107 counts) = java/util/stream/ReduceOps$5ReducingSink
                                  @ 10   java.util.stream.IntPipeline$$Lambda$3/1779653790::applyAsInt (6 bytes)   inline (hot)
                                   -> TypeProfile (1048064/1048064 counts) = java/util/stream/IntPipeline$$Lambda$3
                                    @ 2   java.lang.Integer::sum (4 bytes)   inline (hot)

那是正在编译的热循环.(% 表示它是 On Stack Replaced,或 OSR'ed)

That is the hot loop getting compiled. (The % means it is On Stack Replaced, or OSR'ed)

稍后会进一步编译流管道(我怀疑基准方法的迭代次数约为 10,000 次,但我尚未验证):

Later on further compilation of the stream pipeline occurs (at i suspect ~10,000 iterations of the benchmark method, but I have not verified):

                          @ 16   java.util.stream.IntPipeline::sum (11 bytes)   inline (hot)
                           -> TypeProfile (5120/5120 counts) = java/util/stream/IntPipeline$3
                            @ 2   java.lang.invoke.LambdaForm$MH/1279902262::linkToTargetMethod (8 bytes)   force inline by annotation
                              @ 4   java.lang.invoke.LambdaForm$MH/1847865997::identity (18 bytes)   force inline by annotation
                                @ 14   java.lang.invoke.LambdaForm$DMH/2024969684::invokeStatic_L_L (14 bytes)   force inline by annotation
                                  @ 1   java.lang.invoke.DirectMethodHandle::internalMemberName (8 bytes)   force inline by annotation
                                  @ 10   sun.invoke.util.ValueConversions::identity (2 bytes)   inline (hot)
                            @ 7   java.util.stream.IntPipeline::reduce (16 bytes)   inline (hot)
                              @ 3   java.util.stream.ReduceOps::makeInt (18 bytes)   inline (hot)
                                @ 1   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
                                @ 14   java.util.stream.ReduceOps$5::<init> (16 bytes)   inline (hot)
                                  @ 12   java.util.stream.ReduceOps$ReduceOp::<init> (10 bytes)   inline (hot)
                                    @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
                              @ 6   java.util.stream.AbstractPipeline::evaluate (94 bytes)   inline (hot)
                                @ 50   java.util.stream.AbstractPipeline::isParallel (8 bytes)   inline (hot)
                                @ 80   java.util.stream.TerminalOp::getOpFlags (2 bytes)   inline (hot)
                                 -> TypeProfile (5122/5122 counts) = java/util/stream/ReduceOps$5
                                @ 85   java.util.stream.AbstractPipeline::sourceSpliterator (163 bytes)   inline (hot)
                                  @ 79   java.util.stream.AbstractPipeline::isParallel (8 bytes)   inline (hot)
                                @ 88   java.util.stream.ReduceOps$ReduceOp::evaluateSequential (18 bytes)   inline (hot)
                                  @ 2   java.util.stream.ReduceOps$5::makeSink (5 bytes)   inline (hot)
                                    @ 1   java.util.stream.ReduceOps$5::makeSink (16 bytes)   inline (hot)
                                      @ 12   java.util.stream.ReduceOps$5ReducingSink::<init> (15 bytes)   inline (hot)
                                        @ 11   java.lang.Object::<init> (1 bytes)   inline (hot)
                                  @ 6   java.util.stream.AbstractPipeline::wrapAndCopyInto (18 bytes)   inline (hot)
                                    @ 3   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
                                    @ 9   java.util.stream.AbstractPipeline::wrapSink (37 bytes)   inline (hot)
                                      @ 1   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
                                      @ 23   java.util.stream.IntPipeline$3::opWrapSink (10 bytes)   inline (hot)
                                       -> TypeProfile (4868/4868 counts) = java/util/stream/IntPipeline$3
                                        @ 6   java.util.stream.IntPipeline$3$1::<init> (11 bytes)   inline (hot)
                                          @ 7   java.util.stream.Sink$ChainedInt::<init> (16 bytes)   inline (hot)
                                            @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
                                            @ 6   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
                                    @ 13   java.util.stream.AbstractPipeline::copyInto (53 bytes)   inline (hot)
                                      @ 1   java.util.Objects::requireNonNull (14 bytes)   inline (hot)
                                      @ 9   java.util.stream.AbstractPipeline::getStreamAndOpFlags (5 bytes)   accessor
                                      @ 12   java.util.stream.StreamOpFlag::isKnown (19 bytes)   inline (hot)
                                      @ 20   java.util.Spliterator::getExactSizeIfKnown (25 bytes)   inline (hot)
                                       -> TypeProfile (4870/4870 counts) = java/util/Spliterators$IntArraySpliterator
                                        @ 1   java.util.Spliterators$IntArraySpliterator::characteristics (5 bytes)   accessor
                                        @ 19   java.util.Spliterators$IntArraySpliterator::estimateSize (11 bytes)   inline (hot)
                                      @ 25   java.util.stream.Sink$ChainedInt::begin (11 bytes)   inline (hot)
                                       -> TypeProfile (4870/4870 counts) = java/util/stream/IntPipeline$3$1
                                        @ 5   java.util.stream.ReduceOps$5ReducingSink::begin (9 bytes)   inline (hot)
                                         -> TypeProfile (4871/4871 counts) = java/util/stream/ReduceOps$5ReducingSink
                                      @ 32   java.util.Spliterator$OfInt::forEachRemaining (53 bytes)   inline (hot)
                                        @ 12   java.util.Spliterators$IntArraySpliterator::forEachRemaining (68 bytes)   inline (hot)
                                          @ 53   java.util.stream.IntPipeline$3$1::accept (23 bytes)   inline (hot)
                                            @ 12   test.Measure$$Lambda$2/1745776415::applyAsInt (9 bytes)   inline (hot)
                                             -> TypeProfile (1048107/1048107 counts) = test/Measure$$Lambda$2
                                              @ 5   test.Measure::lambda$setup$1 (4 bytes)   inlining too deep
                                            @ 17   java.util.stream.ReduceOps$5ReducingSink::accept (19 bytes)   inline (hot)
                                             -> TypeProfile (1048107/1048107 counts) = java/util/stream/ReduceOps$5ReducingSink
                                              @ 10   java.util.stream.IntPipeline$$Lambda$3/1779653790::applyAsInt (6 bytes)   inlining too deep
                                               -> TypeProfile (1048064/1048064 counts) = java/util/stream/IntPipeline$$Lambda$3
                                          @ 53   java.util.stream.IntPipeline$3$1::accept (23 bytes)   inline (hot)
                                            @ 12   test.Measure$$Lambda$2/1745776415::applyAsInt (9 bytes)   inline (hot)
                                             -> TypeProfile (1048107/1048107 counts) = test/Measure$$Lambda$2
                                              @ 5   test.Measure::lambda$setup$1 (4 bytes)   inlining too deep
                                            @ 17   java.util.stream.ReduceOps$5ReducingSink::accept (19 bytes)   inline (hot)
                                             -> TypeProfile (1048107/1048107 counts) = java/util/stream/ReduceOps$5ReducingSink
                                              @ 10   java.util.stream.IntPipeline$$Lambda$3/1779653790::applyAsInt (6 bytes)   inlining too deep
                                               -> TypeProfile (1048064/1048064 counts) = java/util/stream/IntPipeline$$Lambda$3
                                      @ 38   java.util.stream.Sink$ChainedInt::end (10 bytes)   inline (hot)
                                        @ 4   java.util.stream.Sink::end (1 bytes)   inline (hot)
                                         -> TypeProfile (5120/5120 counts) = java/util/stream/ReduceOps$5ReducingSink
                                  @ 12   java.util.stream.ReduceOps$5ReducingSink::get (5 bytes)   inline (hot)
                                    @ 1   java.util.stream.ReduceOps$5ReducingSink::get (8 bytes)   inline (hot)
                                      @ 4   java.lang.Integer::valueOf (32 bytes)   inline (hot)
                                        @ 28   java.lang.Integer::<init> (10 bytes)   inline (hot)
                                          @ 1   java.lang.Number::<init> (5 bytes)   inline (hot)
                                            @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
                              @ 12   java.lang.Integer::intValue (5 bytes)   accessor

注意热循环中的方法发生的内联太深".

Note the "inlining too deep" that occurs for methods in the hot loop.

甚至在稍后编译生成的 JMH 测量循环:

Even later on the generated JMH measurement loop is compiled:

  26857  685       3       test.generated.Measure_multiply::multiply_avgt_jmhLoop (55 bytes)
                              @ 7   java.lang.System::nanoTime (0 bytes)   intrinsic
                              @ 16   test.Measure::multiply (23 bytes)
                                @ 4   java.util.Arrays::stream (8 bytes)
                                  @ 4   java.util.Arrays::stream (11 bytes)
                                    @ 3   java.util.Arrays::spliterator (10 bytes)
                                      @ 6   java.util.Spliterators::spliterator (25 bytes)   callee is too large
                                    @ 7   java.util.stream.StreamSupport::intStream (14 bytes)
                                      @ 6   java.util.stream.StreamOpFlag::fromCharacteristics (37 bytes)   callee is too large
                                      @ 10   java.util.stream.IntPipeline$Head::<init> (8 bytes)
                                        @ 4   java.util.stream.IntPipeline::<init> (8 bytes)
                                          @ 4   java.util.stream.AbstractPipeline::<init> (55 bytes)   callee is too large
                                @ 11   java.util.stream.IntPipeline::map (26 bytes)
                                  @ 1   java.util.Objects::requireNonNull (14 bytes)
                                    @ 8   java.lang.NullPointerException::<init> (5 bytes)   don't inline Throwable constructors
                                  @ 22   java.util.stream.IntPipeline$3::<init> (20 bytes)
                                    @ 16   java.util.stream.IntPipeline$StatelessOp::<init> (29 bytes)   callee is too large
                                @ 16   java.util.stream.IntPipeline::sum (11 bytes)
                                  @ 2   java.lang.invoke.LambdaForm$MH/1279902262::linkToTargetMethod (8 bytes)   force inline by annotation
                                    @ 4   java.lang.invoke.LambdaForm$MH/1847865997::identity (18 bytes)   force inline by annotation
                                      @ 14   java.lang.invoke.LambdaForm$DMH/2024969684::invokeStatic_L_L (14 bytes)   force inline by annotation
                                        @ 1   java.lang.invoke.DirectMethodHandle::internalMemberName (8 bytes)   force inline by annotation
                                        @ 10   sun.invoke.util.ValueConversions::identity (2 bytes)
                                  @ 7   java.util.stream.IntPipeline::reduce (16 bytes)
                                    @ 3   java.util.stream.ReduceOps::makeInt (18 bytes)
                                      @ 1   java.util.Objects::requireNonNull (14 bytes)
                                      @ 14   java.util.stream.ReduceOps$5::<init> (16 bytes)
                                        @ 12   java.util.stream.ReduceOps$ReduceOp::<init> (10 bytes)
                                          @ 1   java.lang.Object::<init> (1 bytes)
                                    @ 6   java.util.stream.AbstractPipeline::evaluate (94 bytes)   callee is too large
                                    @ 12   java.lang.Integer::intValue (5 bytes)

注意没有尝试内联整个流管道,它在到达热循环之前就停止了,参见被调用者太大",从而重新优化热循环.

Note that there is no attempt to inline the whole stream pipeline, it stops well before it reached the hot loop, see "callee is too large", thereby re-optimizing the hot loop.

可以增加内联限制以避免此类行为,例如 -XX:MaxInlineLevel=12.

The inline limit can be increased to avoid such behaviour, for example -XX:MaxInlineLevel=12.

这篇关于Arrays.stream().map().sum() 的性能不稳定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆