当.stream().parallel()做同一件事时,为什么存在Collection.parallelStream()? [英] Why does Collection.parallelStream() exist when .stream().parallel() does the same thing?

查看:74
本文介绍了当.stream().parallel()做同一件事时,为什么存在Collection.parallelStream()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Java 8中,使用两种返回Stream<E>的方法扩展了Collection接口:stream()parallelStream(),这两个方法返回顺序的流,而parallelStream()返回顺序的流. Stream本身还具有parallel()方法,该方法返回等效的并行流(将当前流更改为并行或创建新流).

In Java 8, the Collection interface was extended with two methods that return Stream<E>: stream(), which returns a sequential stream, and parallelStream(), which returns a possibly-parallel stream. Stream itself also has a parallel() method that returns an equivalent parallel stream (either mutating the current stream to be parallel or creating a new stream).

重复项有明显的缺点:

  • It's confusing. A question asks whether calling both parallelStream().parallel() is necessary to be sure the stream is parallel, given that parallelStream() may return a sequential stream. Why does parallelStream() exist if it can't make a guarantee? The other way around is also confusing -- if parallelStream() returns a sequential stream, there's probably a reason (e.g., an inherently sequential data structure for which parallel streams are a performance trap); what should Stream.parallel() do for such a stream? (UnsupportedOperationException is not allowed by parallel()'s specification.)

如果现有实现具有名称相似且返回类型不兼容的方法,则向接口添加方法可能会发生冲突.除了stream()之外,添加parallelStream()会使获得很少收益的风险加倍. (请注意,parallelStream()曾一度被命名为parallel(),尽管我不知道是否对其进行了重命名以避免名称冲突或其他原因.)

Adding methods to an interface risks conflicts if an existing implementation has a similarly-named method with an incompatible return type. Adding parallelStream() in addition to stream() doubles the risk for little gain. (Note that parallelStream() was at one point just named parallel(), though I don't know if it was renamed to avoid name clashes or for another reason.)

为什么在调用Collection.stream().parallel()做同样的事情时存在Collection.parallelStream()?

Why does Collection.parallelStream() exist when calling Collection.stream().parallel() does the same thing?

推荐答案

用于Collection.(parallelS|s)tream()Stream的Javadocs本身无法回答问题,因此出于原理的考虑,它已转至邮件列表.我浏览了lambda-libs-spec-observers档案,发现 Brian Goetz (本讨论的主体之一)可以弥补任何遗漏的内容.)

The Javadocs for Collection.(parallelS|s)tream() and Stream itself don't answer the question, so it's off to the mailing lists for the rationale. I went through the lambda-libs-spec-observers archives and found one thread specifically about Collection.parallelStream() and another thread that touched on whether java.util.Arrays should provide parallelStream() to match (or actually, whether it should be removed). There was no once-and-for-all conclusion, so perhaps I've missed something from another list or the matter was settled in private discussion. (Perhaps Brian Goetz, one of the principals of this discussion, can fill in anything missing.)

参与者的观点很好,因此,答案基本上只是相关引语的组织,并在 [括号] 中做了一些说明,并按重要性顺序排列(据我解释)

The participants made their points well, so this answer is mostly just an organization of the relevant quotes, with a few clarifications in [brackets], presented in order of importance (as I interpret it).

Brian Goetz 在第一个线程中,解释了即使删除了其他并行流工厂方法后,为什么Collections.parallelStream()仍然足以保留的价值:

Brian Goetz in the first thread, explaining why Collections.parallelStream() is valuable enough to keep even after other parallel stream factory methods have been removed:

我们没有具有每个这些 [流工厂] 的显式并行版本;我们做到了 最初,为了减少API表面积,我们在 从API中删除20多种方法值得进行权衡的理论 .intRange(...).parallel()的表面皱纹和性能成本. 但是我们没有使用Collection做出选择.

We do not have explicit parallel versions of each of these [stream factories]; we did originally, and to prune down the API surface area, we cut them on the theory that dropping 20+ methods from the API was worth the tradeoff of the surface yuckiness and performance cost of .intRange(...).parallel(). But we did not make that choice with Collection.

我们可以删除Collection.parallelStream(),也可以添加 所有生成器的并行版本,否则我们什么也做不了, 保持原样.我认为所有这些在API设计方面都是合理的.

We could either remove the Collection.parallelStream(), or we could add the parallel versions of all the generators, or we could do nothing and leave it as is. I think all are justifiable on API design grounds.

尽管有点矛盾,但我还是很喜欢现状.代替 有2N个流构建方法,我们有N + 1个-但是那额外的1个 涵盖了很多情况,因为它是每个人都继承的 收藏.所以我可以为自己辩解为什么要有这种额外的1种方法 是值得的,为什么接受不再走一步的矛盾是 可以接受.

I kind of like the status quo, despite its inconsistency. Instead of having 2N stream construction methods, we have N+1 -- but that extra 1 covers a huge number of cases, because it is inherited by every Collection. So I can justify to myself why having that extra 1 method is worth it, and why accepting the inconsistency of going no further is acceptable.

其他人不同意吗? N + 1 [仅适用于Collections.parallelStream()] 是这里的实际选择吗?还是我们应该去 N [依赖Stream.parallel()] 的纯度?还是2N [所有工厂的并行版本] 的方便性和一致性?或者是 还有一些更好的N + 3 [Collections.parallelStream()加上其他特殊情况] ,对于其他一些特殊选择的情况,我们 想给予特殊支持吗?

Do others disagree? Is N+1 [Collections.parallelStream() only] the practical choice here? Or should we go for the purity of N [rely on Stream.parallel()]? Or the convenience and consistency of 2N [parallel versions of all factories]? Or is there some even better N+3 [Collections.parallelStream() plus other special cases], for some other specially chosen cases we want to give special support to?

Brian Goetz 在后面有关Arrays.parallelStream()的讨论中坚持这一立场:

Brian Goetz stands by this position in the later discussion about Arrays.parallelStream():

我仍然非常喜欢Collection.parallelStream;它有巨大的 可发现性的优势,并在API上提供了可观的回报 表面积-另一种方法,但可以在很多地方提供价值, 因为Collection是流源的真正常见情况.

I still really like Collection.parallelStream; it has huge discoverability advantages, and offers a pretty big return on API surface area -- one more method, but provides value in a lot of places, since Collection will be a really common case of a stream source.

parallelStream()性能更高

Brian Goetz :

直接版本 [parallelStream()] 的性能更高,因为它需要更少的包装( 将流转换为并行流,您必须先创建 顺序流,然后将其状态的所有权转移到新 流.)

Direct version [parallelStream()] is more performant, in that it requires less wrapping (to turn a stream into a parallel stream, you have to first create the sequential stream, then transfer ownership of its state into a new Stream.)

为回应Kevin Bourrillion对效果是否显着的怀疑,

In response to Kevin Bourrillion's skepticism about whether the effect is significant, Brian again:

取决于您对计数的重视程度.道格计算单个对象 在进行并行操作的过程中进行创建和虚拟调用, 因为在开始分叉之前,您处于Amdahl的错误立场 法律-这是在分叉之前发生的所有序列分数" 任何工作,将您的收支平衡极限进一步提高.所以得到 快速建立并行操作的路径很有价值.

Depends how seriously you are counting. Doug counts individual object creations and virtual invocations on the way to a parallel operation, because until you start forking, you're on the wrong side of Amdahl's law -- this is all "serial fraction" that happens before you can fork any work, which pushes your breakeven threshold further out. So getting the setup path for parallel ops fast is valuable.

Doug Lea跟进了,但对冲他的位置:

处理并行库支持的人们需要一些态度 对这种事情进行调整.在即将成为典型的机器上, 您浪费的每个周期设置并行度要花费64个周期. 如果需要64,您可能会有不同的反应 对象创建以开始并行计算.

People dealing with parallel library support need some attitude adjustment about such things. On a soon-to-be-typical machine, every cycle you waste setting up parallelism costs you say 64 cycles. You would probably have had a different reaction if it required 64 object creations to start a parallel computation.

也就是说,我始终完全支持强制实施者 为了更好的API而努力工作,只要 API并不排除有效的实现.所以如果杀了 parallelStream真的很重要,我们将找到一些方法来 将stream().parallel()变成翻转或类似的方式.

That said, I'm always completely supportive of forcing implementors to work harder for the sake of better APIs, so long as the APIs do not rule out efficient implementation. So if killing parallelStream is really important, we'll find some way to turn stream().parallel() into a bit-flip or somesuch.

实际上,关于Arrays.parallelStream()的后续讨论

Indeed, the later discussion about Arrays.parallelStream() takes notice of lower Stream.parallel() cost.

在讨论时,可以将流从顺序切换到并行再切换回去可以与其他流操作交错进行. Brian Goetz,代表Doug Lea ,解释了为什么顺序/并行模式切换可能会使Java平台的未来开发变得复杂:

At the time of the discussion, switching a stream from sequential to parallel and back could be interleaved with other stream operations. Brian Goetz, on behalf of Doug Lea, explains why sequential/parallel mode switching may complicate future development of the Java platform:

我将竭尽全力解释原因:因为它(像有状态的 您也不喜欢的方法(排序,不同,限制)),请移动我们 距离能够表达流管道越来越远 传统数据并行构造的术语,这进一步限制了 我们将它们直接映射到明天的计算基础上的能力, 无论是矢量处理器,FPGA,GPU还是我们自己准备的东西.

I'll take my best stab at explaining why: because it (like the stateful methods (sort, distinct, limit)) which you also don't like, move us incrementally farther from being able to express stream pipelines in terms of traditional data-parallel constructs, which further constrains our ability to to map them directly to tomorrow's computing substrate, whether that be vector processors, FPGAs, GPUs, or whatever we cook up.

Filter-map-reduce映射非常适合各种并行计算 基材;过滤器并行映射顺序排序限制并行映射uniq减少 没有.

Filter-map-reduce map[s] very cleanly to all sorts of parallel computing substrates; filter-parallel-map-sequential-sorted-limit-parallel-map-uniq-reduce does not.

因此,这里的整个API设计在设计之间体现出许多张力 易于表达用户可能想要表达的事物,并且正在做 可以使我们以透明的成本快速实现预期目标 模型.

So the whole API design here embodies many tensions between making it easy to express things the user is likely to want to express, and doing is in a manner that we can predictably make fast with transparent cost models.

此模式切换已在删除后进一步的讨论.在该库的当前版本中,流管道是顺序的或并行的.对sequential()/parallel()的最后一次呼叫获胜.除了回避状态问题之外,此更改还提高了使用parallel()从顺序流工厂建立并行管道的性能.

This mode switching was removed after further discussion. In the current version of the library, a stream pipeline is either sequential or parallel; last call to sequential()/parallel() wins. Besides side-stepping the statefulness problem, this change also improved the performance of using parallel() to set up a parallel pipeline from a sequential stream factory.

Brian Goetz再次,以回应蒂姆·皮尔斯的论据 Stream.parallel()允许程序员在并行之前顺序地理解流:

Brian Goetz again, in response to Tim Peierls's argument that Stream.parallel() allows programmers to understand streams sequentially before going parallel:

对于此顺序的值,我有一些不同的看法 直觉-我认为普遍的顺序期望"是一种 这整个工作的最大挑战;人们不变 带来不正确的顺序偏见,这会使他们变得愚蠢 诸如使用单元素数组作为欺骗"愚蠢"方法的事情 编译器让他们捕获可变的本地变量,或使用lambda作为 映射将在 计算(以非线程安全的方式),然后指出 他们在做什么,耸耸肩说:是的,但我没有做 并行进行."

I have a slightly different viewpoint about the value of this sequential intuition -- I view the pervasive "sequential expectation" as one if the biggest challenges of this entire effort; people are constantly bringing their incorrect sequential bias, which leads them to do stupid things like using a one-element array as a way to "trick" the "stupid" compiler into letting them capture a mutable local, or using lambdas as arguments to map that mutate state that will be used during the computation (in a non-thread-safe way), and then, when its pointed out that what they're doing, shrug it off and say "yeah, but I'm not doing it in parallel."

我们已经进行了很多设计折衷,以合并顺序和并行 流.我认为,结果是干净的,并将增加 图书馆在十年以上仍然有用的机会,但我没有 特别像鼓励人们认为这是一个 顺序库,侧面有一些平行的书包.

We've made a lot of design tradeoffs to merge sequential and parallel streams. The result, I believe, is a clean one and will add to the library's chances of still being useful in 10+ years, but I don't particularly like the idea of encouraging people to think this is a sequential library with some parallel bags nailed on the side.

这篇关于当.stream().parallel()做同一件事时,为什么存在Collection.parallelStream()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆