Java 8 的流:为什么并行流更慢? [英] Java 8's streams: why parallel stream is slower?

查看:54
本文介绍了Java 8 的流:为什么并行流更慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Java 8 的流,但无法理解我得到的性能结果.我有 2 核 CPU(英特尔 i73520M)、Windows 8 x64 和 64 位 Java 8 update 5.我正在对字符串的流/并行流进行简单映射,发现并行版本有点慢.

I am playing with Java 8's streams and cannot understand the performance results I am getting. I have 2 core CPU (Intel i73520M), Windows 8 x64, and 64-bit Java 8 update 5. I am doing simple map over stream/parallel stream of Strings and found that parallel version is somewhat slower.

Function<Stream<String>, Long> timeOperation = (Stream<String> stream) -> {
  long time1 = System.nanoTime();
  final List<String> list = 
     stream
       .map(String::toLowerCase)
       .collect(Collectors.toList());
  long time2 = System.nanoTime();
  return time2 - time1;
};

Consumer<Stream<String>> printTime = stream ->
  System.out.println(timeOperation.apply(stream) / 1000000f);

String[] array = new String[1000000];
Arrays.fill(array, "AbabagalamagA");

printTime.accept(Arrays.stream(array));            // prints around 600
printTime.accept(Arrays.stream(array).parallel()); // prints around 900

考虑到我有 2 个 CPU 内核,并行版本不应该更快吗?有人能告诉我为什么并行版本慢吗?

Shouldn't the parallel version be faster, considering the fact that I have 2 CPU cores? Could someone give me a hint why parallel version is slower?

推荐答案

这里有几个问题同时发生.

There are several issues going on here in parallel, as it were.

首先,并行解决问题总是需要执行比顺序执行更多的实际工作.开销涉及在多个线程之间拆分工作并加入或合并结果.像将短字符串转换为小写这样的问题非常小,以至于它们有被并行拆分开销淹没的危险.

The first is that solving a problem in parallel always involves performing more actual work than doing it sequentially. Overhead is involved in splitting the work among several threads and joining or merging the results. Problems like converting short strings to lower-case are small enough that they are in danger of being swamped by the parallel splitting overhead.

第二个问题是Java程序的基准测试非常微妙,很容易得到令人困惑的结果.两个常见问题是 JIT 编译和死代码消除.简短的基准测试通常在 JIT 编译之前或期间完成,因此它们不是在测量峰值吞吐量,实际上它们可能是在测量 JIT 本身.编译发生的时间有些不确定,因此也可能导致结果差异很大.

The second issue is that benchmarking Java program is very subtle, and it is very easy to get confusing results. Two common issues are JIT compilation and dead code elimination. Short benchmarks often finish before or during JIT compilation, so they're not measuring peak throughput, and indeed they might be measuring the JIT itself. When compilation occurs is somewhat non-deterministic, so it may cause results to vary wildly as well.

对于小型的综合基准测试,工作负载通常会计算被丢弃的结果.JIT 编译器非常擅长检测这一点并消除不会产生在任何地方使用的结果的代码.在这种情况下这可能不会发生,但如果您修补其他合成工作负载,它肯定会发生.当然,如果 JIT 消除了基准测试工作量,它会使基准测试变得毫无用处.

For small, synthetic benchmarks, the workload often computes results that are thrown away. JIT compilers are quite good at detecting this and eliminating code that doesn't produce results that are used anywhere. This probably isn't happening in this case, but if you tinker around with other synthetic workloads, it can certainly happen. Of course, if the JIT eliminates the benchmark workload, it renders the benchmark useless.

我强烈建议使用完善的基准测试框架,例如 JMH 而不是手动滚动你自己的一个.JMH 拥有帮助避免常见的基准测试陷阱的工具,包括这些,并且设置和运行非常容易.这是转换为使用 JMH 的基准测试:

I strongly recommend using a well-developed benchmarking framework such as JMH instead of hand-rolling one of your own. JMH has facilities to help avoid common benchmarking pitfalls, including these, and it's pretty easy to set up and run. Here's your benchmark converted to use JMH:

package com.stackoverflow.questions;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.*;

public class SO23170832 {
    @State(Scope.Benchmark)
    public static class BenchmarkState {
        static String[] array;
        static {
            array = new String[1000000];
            Arrays.fill(array, "AbabagalamagA");
        }
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> sequential(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> parallel(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .parallel()
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }
}

我使用以下命令运行它:

I ran this using the command:

java -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

(这些选项表示 5 次预热迭代、5 次基准测试和 1 次分叉 JVM.)在运行期间,JMH 会发出大量冗长的消息,我已经省略了这些消息.总结结果如下.

(The options indicate five warmup iterations, five benchmark iterations, and one forked JVM.) During its run, JMH emits lots of verbose messages, which I've elided. The summary results are as follows.

Benchmark                       Mode   Samples         Mean   Mean error    Units
c.s.q.SO23170832.parallel      thrpt         5        4.600        5.995    ops/s
c.s.q.SO23170832.sequential    thrpt         5        1.500        1.727    ops/s

请注意,结果以每秒操作数为单位,因此看起来并行运行比顺序运行快三倍.但是我的机器只有两个内核.嗯.而且每次运行的平均误差实际上大于平均运行时间!哇?这里发生了一些可疑的事情.

Note that results are in ops per second, so it looks like the parallel run was about three times faster than the sequential run. But my machine has only two cores. Hmmm. And the mean error per run is actually larger than the mean runtime! WAT? Something fishy is going on here.

这给我们带来了第三个问题.更仔细地观察工作负载,我们可以看到它为每个输入分配一个新的 String 对象,并将结果收集到一个列表中,这涉及大量的重新分配和复制.我猜这会导致大量的垃圾收集.我们可以通过在启用 GC 消息的情况下重新运行基准测试来看到这一点:

This brings us to a third issue. Looking more closely at the workload, we can see that it allocates a new String object for each input, and it also collects the results into a list, which involves lots of reallocation and copying. I'd guess that this will result in a fair amount of garbage collection. We can see this by rerunning the benchmark with GC messages enabled:

java -verbose:gc -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

结果如下:

[GC (Allocation Failure)  512K->432K(130560K), 0.0024130 secs]
[GC (Allocation Failure)  944K->520K(131072K), 0.0015740 secs]
[GC (Allocation Failure)  1544K->777K(131072K), 0.0032490 secs]
[GC (Allocation Failure)  1801K->1027K(132096K), 0.0023940 secs]
# Run progress: 0.00% complete, ETA 00:00:20
# VM invoker: /Users/src/jdk/jdk8-b132.jdk/Contents/Home/jre/bin/java
# VM options: -verbose:gc
# Fork: 1 of 1
[GC (Allocation Failure)  512K->424K(130560K), 0.0015460 secs]
[GC (Allocation Failure)  933K->552K(131072K), 0.0014050 secs]
[GC (Allocation Failure)  1576K->850K(131072K), 0.0023050 secs]
[GC (Allocation Failure)  3075K->1561K(132096K), 0.0045140 secs]
[GC (Allocation Failure)  1874K->1059K(132096K), 0.0062330 secs]
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.stackoverflow.questions.SO23170832.parallel
# Warmup Iteration   1: [GC (Allocation Failure)  7014K->5445K(132096K), 0.0184680 secs]
[GC (Allocation Failure)  7493K->6346K(135168K), 0.0068380 secs]
[GC (Allocation Failure)  10442K->8663K(135168K), 0.0155600 secs]
[GC (Allocation Failure)  12759K->11051K(139776K), 0.0148190 secs]
[GC (Allocation Failure)  18219K->15067K(140800K), 0.0241780 secs]
[GC (Allocation Failure)  22167K->19214K(145920K), 0.0208510 secs]
[GC (Allocation Failure)  29454K->25065K(147456K), 0.0333080 secs]
[GC (Allocation Failure)  35305K->30729K(153600K), 0.0376610 secs]
[GC (Allocation Failure)  46089K->39406K(154624K), 0.0406060 secs]
[GC (Allocation Failure)  54766K->48299K(164352K), 0.0550140 secs]
[GC (Allocation Failure)  71851K->62725K(165376K), 0.0612780 secs]
[GC (Allocation Failure)  86277K->74864K(184320K), 0.0649210 secs]
[GC (Allocation Failure)  111216K->94203K(185856K), 0.0875710 secs]
[GC (Allocation Failure)  130555K->114932K(199680K), 0.1030540 secs]
[GC (Allocation Failure)  162548K->141952K(203264K), 0.1315720 secs]
[Full GC (Ergonomics)  141952K->59696K(159232K), 0.5150890 secs]
[GC (Allocation Failure)  105613K->85547K(184832K), 0.0738530 secs]
1.183 ops/s

注意:以 # 开头的行是正常的 JMH 输出行.其余的都是 GC 消息.这只是五次预热迭代中的第一次,在五次基准迭代之前.在其余的迭代期间,GC 消息以相同的方式继续.我认为可以肯定地说,测量的性能主要由 GC 开销决定,报告的结果不应被相信.

Note: the lines beginning with # are normal JMH output lines. All the rest are GC messages. This is just the first of the five warmup iterations, which precedes five benchmark iterations. The GC messages continued in the same vein during the rest of the iterations. I think it's safe to say that the measured performance is dominated by GC overhead and that the results reported should not be believed.

目前还不清楚该怎么做.这纯粹是一个合成工作负载.与分配和复制相比,它显然只需要很少的 CPU 时间来完成实际工作.很难说你真正想在这里衡量什么.一种方法是提出在某种意义上更真实"的不同工作负载.另一种方法是更改​​堆和 GC 参数以避免在基准测试期间发生 GC.

At this point it's unclear what to do. This is purely a synthetic workload. It clearly involves very little CPU time doing actual work compared to allocation and copying. It's hard to say what you really are trying to measure here. One approach would be to come up with a different workload that is in some sense more "real." Another approach would be to change the heap and GC parameters to avoid GC during the benchmark run.

这篇关于Java 8 的流:为什么并行流更慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆