为什么lambda IntStream.anyMatch()比天真的实现慢10? [英] Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

查看:498
本文介绍了为什么lambda IntStream.anyMatch()比天真的实现慢10?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在分析我的代码并发现了一个有趣的瓶颈。以下是基准:

I was recently profiling my code and found one interesting bottleneck in it. Here is benchmark :

@BenchmarkMode(Mode.Throughput)
@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
public class Contains {

    private int[] ar = new int[] {1,2,3,4,5,6,7};

    private int val = 5;

    @Benchmark
    public boolean naive() {
        return contains(ar, val);
    }

    @Benchmark
    public boolean lambdaArrayStreamContains() {
        return Arrays.stream(ar).anyMatch(i -> i == val);
    }

    @Benchmark
    public boolean lambdaIntStreamContains() {
        return IntStream.of(ar).anyMatch(i -> i == val);
    }

    private static boolean contains(int[] ar, int value) {
        for (int arVal : ar) {
            if (arVal == value) {
                return true;
            }
        }
        return false;
    }

}

结果:

Benchmark                            Mode  Cnt       Score      Error  Units
Contains.lambdaArrayStreamContains  thrpt   10   22867.962 ± 1049.649  ops/s
Contains.lambdaIntStreamContains    thrpt   10   22983.800 ±  593.580  ops/s
Contains.naive                      thrpt   10  228002.406 ± 8591.186  ops/s

如果显示Array包含通过lambda的操作比使用简单循环的naive实现慢10倍。我知道lambdas应该慢一点。但是10次?我做错了lambda或者这是java的一些问题吗?

If shows that Array contains operation via lambda is 10 times slower than naive implementation with simple loop. I knew that lambdas should be a bit slower. But 10 times? Am I doing wrong lambda or this is some issue with java?

推荐答案

你的基准测试实际上没有测量 anyMatch 性能,而不是流开销。与一个非常简单的操作(如五元素数组查找)相比,这种开销可能会显得很大。

Your benchmark does not actually measure anyMatch performance, but rather a stream overhead. This overhead can appear significant when comparing to a very simple operation like a five-element array lookup.

如果我们从相对数字到绝对数字进行切换,那么减速看起来并不那么糟糕。让我们测量延迟而不是吞吐量,以获得更清晰的图像。我省略了 lambdaIntStream 基准,因为它与 lambdaArrayStream 完全相同。

The slowdown will not look so terrible if we swtich from relative to absolute numbers. Let's measure latency instead of throughput for a clearer picture. I've omitted lambdaIntStream benchmark since it works exactly the same way as lambdaArrayStream.

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  53,242 ± 2,034  ns/op
Contains.naive              avgt    5   5,876 ± 0,404  ns/op

5.8 ns大约是2.4 GHz CPU的14个周期。工作量非常小,任何额外的周期都会很明显。那么流操作的开销是多少?

5.8 ns is roughly 14 cycles of a 2.4 GHz CPU. The workload is so small that any extra cycle will be noticeable. So what is the overhead of stream operations?

现在用<$ c $重新运行基准测试c> -prof gc profiler。它将显示堆分配的数量:

Now rerun the benchmark with -prof gc profiler. It will show the amount of heap allocations:

Benchmark                                       Mode  Cnt     Score     Error   Units
Contains.lambdaArrayStream:·gc.alloc.rate.norm  avgt    5   152,000 ±   0,001    B/op
Contains.naive:·gc.alloc.rate.norm              avgt    5    ≈ 10⁻⁵              B/op

lambdaArrayStream 每次迭代分配152个字节,而天真基准没有任何分配。当然,分配不是免费的:至少有5个对象构造成支持 anyMatch ,每个对象需要几纳秒:

lambdaArrayStream allocates 152 bytes per iteration while naive benchmark allocates nothing. Of course, allocation is not free: there are at least 5 objects constructed to support anyMatch, and each takes several nanoseconds:


  • Lambda i - > i == val

  • IntPipeline.Head

  • Spliterators.IntArraySpliterator

  • MatchOps.MatchOp

  • MatchOps.MatchSink

  • Lambda i -> i == val
  • IntPipeline.Head
  • Spliterators.IntArraySpliterator
  • MatchOps.MatchOp
  • MatchOps.MatchSink

java.util.stream 实现有点复杂,因为它必须支持流源,中间和终端操作的所有组合。如果您在基准测试中查看 anyMatch 的调用堆栈,您会看到类似的内容:

java.util.stream implementation is a bit complicated since it must support all combinations of stream sources, intermediate and terminal operations. If you look at the call stack of anyMatch in your benchmark, you'll see something like that:

    at bench.Contains.lambda$lambdaArrayStream$0(Contains.java:24)
    at java.util.stream.MatchOps$2MatchSink.accept(MatchOps.java:119)
    at java.util.Spliterators$IntArraySpliterator.tryAdvance(Spliterators.java:1041)
    at java.util.stream.IntPipeline.forEachWithCancel(IntPipeline.java:162)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.IntPipeline.anyMatch(IntPipeline.java:477)
    at bench.Contains.lambdaArrayStream(Contains.java:23)

并非所有这些方法调用都可以内联。此外,JVM将内联限制为9个级别,但在这里我们看到更深层次的调用堆栈。如果我们用 -XX:MaxInlineLevel = 20 覆盖限制,则分数会变得更好:

Not all of these method calls can be inlined. Furthermore, JVM limits inlining to 9 levels, but here we see the deeper call stack. If we override the limit with -XX:MaxInlineLevel=20 the score will become a bit better:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,294 ± 0,367  ns/op  (was 53,242)
Contains.naive              avgt    5   5,822 ± 0,207  ns/op



循环优化



for 对数组进行迭代是一个微不足道的计数循环。 JVM可以在这里应用各种循环优化:循环剥离,循环展开等。这不适用于 -kind循环在 forEachWithCancel 方法,用于遍历IntStream。循环优化的效果可以用 -XX来衡量:LoopUnrollLimit = 0 -XX:-UseLoopPredicate

Loop optimizations

for iteration over an array is a trivial counted loop. JVM can apply a wide range of loop optimizatons here: loop peeling, loop unrolling etc. This does not work for a while-kind loop in forEachWithCancel method, which is used to traverse an IntStream. The effect of loop optimizations can be measured with -XX:LoopUnrollLimit=0 -XX:-UseLoopPredicate:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,153 ± 0,559  ns/op
Contains.naive              avgt    5   9,853 ± 0,150  ns/op  (was 5,876)



结论



构造和遍历流的一些开销,但这是完全理解的,不能被视为错误。我不会说开销很大(甚至50 ns / op也不是那么多);但是,在这个特定的例子中,由于工作量极小,开销占主导地位。

Conclusions

There is some overhead to construct and to traverse a stream, but this is completely understood and cannot be considered a bug. I would not say the overhead is big (even 50 ns/op is not that much); however, in this particular example the overhead is dominant because of an extremely tiny workload.

这篇关于为什么lambda IntStream.anyMatch()比天真的实现慢10?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆