在考虑并行化外部流之前,是否会完全并行处理内部并行流? [英] Will inner parallel streams be processed fully in parallel before considering parallelizing outer stream?

查看:216
本文介绍了在考虑并行化外部流之前,是否会完全并行处理内部并行流?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从这个链接,我只是部分理解,至少在某些时候,java嵌套并行流存在问题。但是,我无法推断出以下问题的答案:

From this link, I only partially understood that, at least at some point, there was a problem with java nested parallel streams. However, I couldn't deduce the answer to the following question:

假设我有一个外部srtream和一个内部流,两者都使用并行流。事实证明,根据我的计算,如果内部流首先完全并行完成,那么它将更高效(由于数据位置,即L1 / L2 / L3 CPU缓存中的缓存)(如果且仅cpu核心可用)做外部流。我认为这对大多数人的情况都是如此。所以我的问题是:

Let's say I have an outer srtream and an inner stream, both of which are using parallel stream. It turns out, according to my calculations, that it'll be more performant (due to data locality, ie caching in L1/L2/L3 CPU caches) if the inner stream is done fully in parallel first, and then (if and only cpu cores are available) do the outer stream. I think this is true for most people's situations. So my question is:

Java首先会并行执行内部流,然后在outerstream上工作吗?如果是这样,它是在编译时还是在运行时做出决定?如果在运行时,JIT甚至足够聪明地意识到如果内部流确实具有比核心数(32)更多的元素(例如数百个),那么它肯定应该使用所有32个内核来处理在从外部流移动下一个元素之前的内部流;但是,如果元素的数量很小(例如< 32),则并行处理下一个外部流的元素中的元素也可以。

Would Java execute inner stream all in parallel first, and then work on outerstream? If so, does it make that decision at compile time or at run-time? If at run-time, is JIT even smart enough to realize that if the inner stream does have more-than-enough elements (eg hundreds) vs the # of cores (32), then it should definitely use all 32 cores for deal with inner stream before moving on the next element from outer stream; but, if the number of elements in small (eg < 32), then it's ok to "also process in parallel" the elements from the "next" outer stream's elements.

推荐答案

以下示例程序可能会对此问题有所了解:

Maybe the following example program sheds some light on the issue:

IntStream.range(0, 10).parallel().mapToObj(i -> "outer "+i)
         .map(outer -> outer+"\t"+IntStream.range(0, 10).parallel()
            .mapToObj(inner -> Thread.currentThread())
            .distinct() // using the identity of the threads
            .map(Thread::getName) // just to be paranoid, as names might not be unique
            .sorted()
            .collect(Collectors.toList()) )
         .collect(Collectors.toList())
         .forEach(System.out::println);

当然,结果会有所不同,但我机器上的输出看起来与此类似:

Of course, the results will vary, but the output on my machine looks similar to this:

outer 0 [ForkJoinPool.commonPool-worker-6]
outer 1 [ForkJoinPool.commonPool-worker-3]
outer 2 [ForkJoinPool.commonPool-worker-1]
outer 3 [ForkJoinPool.commonPool-worker-1, ForkJoinPool.commonPool-worker-4, ForkJoinPool.commonPool-worker-5]
outer 4 [ForkJoinPool.commonPool-worker-5]
outer 5 [ForkJoinPool.commonPool-worker-2, ForkJoinPool.commonPool-worker-4, ForkJoinPool.commonPool-worker-7, main]
outer 6 [main]
outer 7 [ForkJoinPool.commonPool-worker-4]
outer 8 [ForkJoinPool.commonPool-worker-2]
outer 9 [ForkJoinPool.commonPool-worker-7]

我们在这里看到的是,对于我的机器,有八个核心,七个工人线程有助于工作,利用所有核心,如公共池,调用者线程也将对工作做出贡献,而不是仅仅等待完成。您可以清楚地看到输出中的 main 线程。

What we can see here, is that for my machine, having eight cores, seven worker threads are contributing to the work, to utilize all cores, as for the common pool, the caller thread will contribute to the work as well, instead of just waiting for the completion. You can clearly see the main thread within the output.

此外,您可以看到外部流获取完全并行,而一些内部流仅由单个线程完全处理。每个工作线程都对外部流的元素中的至少一个做出贡献。如果将外部流的大小减小到核心数,则很可能只看到一个工作线程处理一个外部流元素,这意味着完全顺序执行所有内部流。

Also, you can see that the outer stream gets the full parallelism, while some of the inner streams are entirely processed by a single thread only. Each of the worker threads contributes to at least one of the outer stream’s elements. If you reduce the size of the outer stream to the number of cores, you are very likely to see exactly one worker thread processing one outer stream element, implying an entirely sequential execution of all inner streams.

但我使用的数字与核心数量不匹配,甚至不是它的倍数,以证明另一种行为。由于外部流处理的工作量不均匀,即某些线程只处理一个项目,其他线程处理两个,这些空闲工作线程执行工作窃取,为剩余外部元素的内部流处理做出贡献。

But I used a number not matching the number of cores, not even a multiple of it, to demonstrate another behavior. Since the workload for the outer stream processing is not even, i.e. some threads only process one item, others process two, these idle worker threads perform work-stealing, contributing the the inner stream processing of the remaining outer elements.

这种行为背后有一个简单的理由。当外部流的处理开始时,它不知道它将是外部流。它只是一个并行流,没有办法找出这是否是一个外部流,而不是处理它直到其中一个函数启动另一个流操作。但推迟并行处理是没有意义的,直到这一点可能永远不会到来。

There is a simple rationale behind this behavior. When the processing of the outer stream starts, it doesn’t know that it will be an "outer stream". It’s just a parallel stream and there is no way of finding out whether this is an outer stream other than processing it until one of the functions starts another stream operation. But there is no sense in deferring the parallel processing until this point which might never come.

除此之外,我强烈反对你假设它会更高效[ ......]如果内部流首先完全并行完成。对于典型的用例,我宁愿相反地期望它,阅读,期望与实现完全一样的优势。但是,正如前一段所述,无论如何都没有合理的方法来实现并行处理内部流的首选项。

Besides that, I strongly object you assumption "that it'll be more performant […] if the inner stream is done fully in parallel first". I’d rather expect it the other way round, read, expect an advantage doing it exactly like it has been implemented, for typical use cases. But, as explained in the previous paragraph, there is no reasonable way to implement a preference for processing inner streams in parallel anyway.

这篇关于在考虑并行化外部流之前,是否会完全并行处理内部并行流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆