当 .stream().parallel() 做同样的事情时,为什么 Collection.parallelStream() 存在? [英] Why does Collection.parallelStream() exist when .stream().parallel() does the same thing?

查看:35
本文介绍了当 .stream().parallel() 做同样的事情时,为什么 Collection.parallelStream() 存在?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Java 8 中,Collection 接口扩展了两个返回 Stream<E> 的方法:返回顺序流的 stream() 和返回顺序流的 parallelStream(),它返回一个可能并行的流.Stream 本身也有一个 parallel() 方法,该方法返回等效的并行流(将当前流变为并行或创建新流).

In Java 8, the Collection interface was extended with two methods that return Stream<E>: stream(), which returns a sequential stream, and parallelStream(), which returns a possibly-parallel stream. Stream itself also has a parallel() method that returns an equivalent parallel stream (either mutating the current stream to be parallel or creating a new stream).

重复有明显的缺点:

  • 令人困惑.一个问题询问 是否有必要同时调用 parallelStream().parallel()该流是并行的,因为 parallelStream() 可能会返回一个顺序流.如果无法保证,为什么parallelStream()存在?反过来也令人困惑——如果 parallelStream() 返回一个顺序流,那可能是有原因的(例如,并行流是性能陷阱的固有顺序数据结构);Stream.parallel() 应该为这样的流做什么?(parallel() 的规范不允许 UnsupportedOperationException.)

  • It's confusing. A question asks whether calling both parallelStream().parallel() is necessary to be sure the stream is parallel, given that parallelStream() may return a sequential stream. Why does parallelStream() exist if it can't make a guarantee? The other way around is also confusing -- if parallelStream() returns a sequential stream, there's probably a reason (e.g., an inherently sequential data structure for which parallel streams are a performance trap); what should Stream.parallel() do for such a stream? (UnsupportedOperationException is not allowed by parallel()'s specification.)

如果现有实现有一个名称相似且返回类型不兼容的方法,则向接口添加方法可能会发生冲突.在 stream() 之外添加 parallelStream() 会使风险增加一倍,但收益甚微.(请注意,parallelStream() 曾一度被命名为 parallel(),但我不知道它是否被重命名以避免名称冲突或其他原因.)

Adding methods to an interface risks conflicts if an existing implementation has a similarly-named method with an incompatible return type. Adding parallelStream() in addition to stream() doubles the risk for little gain. (Note that parallelStream() was at one point just named parallel(), though I don't know if it was renamed to avoid name clashes or for another reason.)

为什么在调用 Collection.stream().parallel() 时存在 Collection.parallelStream() 做同样的事情?

Why does Collection.parallelStream() exist when calling Collection.stream().parallel() does the same thing?

推荐答案

Collection.(parallelS|s)tream() 的 Javadocs 和 Stream 本身没有回答这个问题,所以它的理由是邮件列表.我浏览了 lambda-libs-spec-observers 档案,发现 一个线程专门关于 Collection.parallelStream() 和另一个线程涉及 java.util.Arrays 应该提供 parallelStream() 来匹配(或者实际上,是否应该删除它).没有一劳永逸的结论,所以也许我错过了另一个列表中的某些内容,或者此事已在私下讨论中解决.(也许 Brian Goetz,本次讨论的主要负责人之一,可以填补任何缺失的内容.)

The Javadocs for Collection.(parallelS|s)tream() and Stream itself don't answer the question, so it's off to the mailing lists for the rationale. I went through the lambda-libs-spec-observers archives and found one thread specifically about Collection.parallelStream() and another thread that touched on whether java.util.Arrays should provide parallelStream() to match (or actually, whether it should be removed). There was no once-and-for-all conclusion, so perhaps I've missed something from another list or the matter was settled in private discussion. (Perhaps Brian Goetz, one of the principals of this discussion, can fill in anything missing.)

参与者的观点很好,所以这个答案主要只是对相关引用的组织,在[括号]中进行了一些澄清,按重要性顺序排列(按照我的解释).

The participants made their points well, so this answer is mostly just an organization of the relevant quotes, with a few clarifications in [brackets], presented in order of importance (as I interpret it).

Brian Goetz在第一个线程中,解释了为什么 Collections.parallelStream() 具有足够的价值,即使在其他并行流工厂方法被删除后也可以保留:

Brian Goetz in the first thread, explaining why Collections.parallelStream() is valuable enough to keep even after other parallel stream factory methods have been removed:

我们确实没有每个[流工厂]都有明确的并行版本;我们做了最初,为了修剪 API 表面积,我们在从 API 中删除 20 多种方法的理论值得权衡.intRange(...).parallel() 的表面恶心和性能成本.但是我们没有在 Collection 中做出这样的选择.

We do not have explicit parallel versions of each of these [stream factories]; we did originally, and to prune down the API surface area, we cut them on the theory that dropping 20+ methods from the API was worth the tradeoff of the surface yuckiness and performance cost of .intRange(...).parallel(). But we did not make that choice with Collection.

我们可以删除 Collection.parallelStream(),也可以添加所有生成器的并行版本,否则我们无能为力保持原样.我认为所有的 API 设计都是合理的.

We could either remove the Collection.parallelStream(), or we could add the parallel versions of all the generators, or we could do nothing and leave it as is. I think all are justifiable on API design grounds.

我有点喜欢现状,尽管它自相矛盾.代替有 2N 种流构造方法,我们有 N+1——但那额外的 1涵盖了大量的案例,因为它被每个人继承收藏.所以我可以向自己证明为什么要使用额外的 1 方法是值得的,为什么接受不进一步的不一致是可以接受.

I kind of like the status quo, despite its inconsistency. Instead of having 2N stream construction methods, we have N+1 -- but that extra 1 covers a huge number of cases, because it is inherited by every Collection. So I can justify to myself why having that extra 1 method is worth it, and why accepting the inconsistency of going no further is acceptable.

其他人不同意吗?N+1 [Collections.parallelStream() only] 是这里的实际选择吗?或者我们应该去为了N 的纯度[依赖Stream.parallel()]?还是2N[所有工厂的并行版本]的便利性和一致性?或者是还有一些更好的 N+3 [Collections.parallelStream() 加上其他特殊情况],对于其他一些特别选择的情况,我们要特别支持吗?

Do others disagree? Is N+1 [Collections.parallelStream() only] the practical choice here? Or should we go for the purity of N [rely on Stream.parallel()]? Or the convenience and consistency of 2N [parallel versions of all factories]? Or is there some even better N+3 [Collections.parallelStream() plus other special cases], for some other specially chosen cases we want to give special support to?

Brian Goetz在后面关于 Arrays.parallelStream() 的讨论中支持这个立场:

Brian Goetz stands by this position in the later discussion about Arrays.parallelStream():

我还是很喜欢 Collection.parallelStream;它有巨大的可发现性优势,并提供相当大的 API 回报表面积——另一种方法,但在很多地方提供价值,因为 Collection 将是流源的一个非常常见的情况.

I still really like Collection.parallelStream; it has huge discoverability advantages, and offers a pretty big return on API surface area -- one more method, but provides value in a lot of places, since Collection will be a really common case of a stream source.

parallelStream() 性能更高

Brian Goetz:

直接版本 [parallelStream()] 性能更高,因为它需要更少的包装(到将流转换为并行流,您必须首先创建顺序流,然后将其状态的所有权转移到一个新的流.)

Direct version [parallelStream()] is more performant, in that it requires less wrapping (to turn a stream into a parallel stream, you have to first create the sequential stream, then transfer ownership of its state into a new Stream.)

针对 Kevin Bourrillion 对效果是否显着的怀疑,又是布赖恩:

In response to Kevin Bourrillion's skepticism about whether the effect is significant, Brian again:

取决于您计算的认真程度.道格计算单个物体并行操作途中的创建和虚拟调用,因为在你开始分叉之前,你站在 Amdahl 的错误一边法则——这是在你可以分叉之前发生的所有串行分数"任何工作,这将您的盈亏平衡门槛推得更远.所以得到快速并行操作的设置路径很有价值.

Depends how seriously you are counting. Doug counts individual object creations and virtual invocations on the way to a parallel operation, because until you start forking, you're on the wrong side of Amdahl's law -- this is all "serial fraction" that happens before you can fork any work, which pushes your breakeven threshold further out. So getting the setup path for parallel ops fast is valuable.

Doug Lea 跟进,但对冲他的位置:

处理并行库支持的人需要一些态度调整这些事情.在即将成为典型的机器上,您浪费的每个周期设置并行性都会花费您说的 64 个周期.如果需要 64,你可能会有不同的反应创建对象以启动并行计算.

People dealing with parallel library support need some attitude adjustment about such things. On a soon-to-be-typical machine, every cycle you waste setting up parallelism costs you say 64 cycles. You would probably have had a different reaction if it required 64 object creations to start a parallel computation.

也就是说,我始终完全支持强制实施者为了更好的 API 而努力工作,只要API 不排除有效的实施.所以如果杀parallelStream 真的很重要,我们会想办法将 stream().parallel() 转换为 bit-flip 之类的.

That said, I'm always completely supportive of forcing implementors to work harder for the sake of better APIs, so long as the APIs do not rule out efficient implementation. So if killing parallelStream is really important, we'll find some way to turn stream().parallel() into a bit-flip or somesuch.

确实,后面关于Arrays.parallelStream()的讨论注意到较低的 Stream.parallel() 成本.

Indeed, the later discussion about Arrays.parallelStream() takes notice of lower Stream.parallel() cost.

在讨论时,将流从顺序切换到并行并返回可能与其他流操作交错.Brian Goetz,代表 Doug Lea,解释了为什么顺序/并行模式切换可能会使 Java 平台的未来开发变得复杂:

At the time of the discussion, switching a stream from sequential to parallel and back could be interleaved with other stream operations. Brian Goetz, on behalf of Doug Lea, explains why sequential/parallel mode switching may complicate future development of the Java platform:

我会尽力解释原因:因为它(就像有状态的您也不喜欢的方法(排序,不同,限制)),请移步我们越来越远无法表达流管道传统数据并行结构的术语,这进一步限制了我们将它们直接映射到明天的计算基板的能力,无论是矢量处理器、FPGA、GPU 还是我们制作的任何东西.

I'll take my best stab at explaining why: because it (like the stateful methods (sort, distinct, limit)) which you also don't like, move us incrementally farther from being able to express stream pipelines in terms of traditional data-parallel constructs, which further constrains our ability to to map them directly to tomorrow's computing substrate, whether that be vector processors, FPGAs, GPUs, or whatever we cook up.

Filter-map-reduce map[s] 非常干净地适用于各种并行计算基材;filter-parallel-map-sequential-sorted-limit-parallel-map-uniq-reduce没有.

Filter-map-reduce map[s] very cleanly to all sorts of parallel computing substrates; filter-parallel-map-sequential-sorted-limit-parallel-map-uniq-reduce does not.

因此,这里的整个 API 设计体现了制作之间的许多紧张关系易于表达用户可能想要表达的事情,并且正在做以一种我们可以预见的方式以透明的成本快速实现模型.

So the whole API design here embodies many tensions between making it easy to express things the user is likely to want to express, and doing is in a manner that we can predictably make fast with transparent cost models.

这种模式切换在 之后被移除进一步讨论.在当前版本的库中,流管道是顺序的或并行的;最后一次调用 sequential()/parallel() 获胜.除了回避状态问题之外,此更改还提高了使用 parallel() 从顺序流工厂设置并行管道的性能.

This mode switching was removed after further discussion. In the current version of the library, a stream pipeline is either sequential or parallel; last call to sequential()/parallel() wins. Besides side-stepping the statefulness problem, this change also improved the performance of using parallel() to set up a parallel pipeline from a sequential stream factory.

Brian Goetz 再次,回应Tim Peierls 的论点 Stream.parallel() 允许程序员在并行之前按顺序理解流:

Brian Goetz again, in response to Tim Peierls's argument that Stream.parallel() allows programmers to understand streams sequentially before going parallel:

我对这个顺序的价值有一点不同的看法直觉——我认为普遍的顺序期望"是一个如果整个工作的最大挑战;人们一直带来他们不正确的顺序偏差,这导致他们做傻事诸如使用单元素数组作为欺骗"愚蠢"的一种方式编译器让他们捕获一个可变的本地,或使用 lambdas映射将在计算(以非线程安全的方式),然后,当它指出他们在做什么,耸耸肩说是的,但我没有做它是并行的."

I have a slightly different viewpoint about the value of this sequential intuition -- I view the pervasive "sequential expectation" as one if the biggest challenges of this entire effort; people are constantly bringing their incorrect sequential bias, which leads them to do stupid things like using a one-element array as a way to "trick" the "stupid" compiler into letting them capture a mutable local, or using lambdas as arguments to map that mutate state that will be used during the computation (in a non-thread-safe way), and then, when its pointed out that what they're doing, shrug it off and say "yeah, but I'm not doing it in parallel."

我们在设计上做了很多权衡来合并顺序和并行流.我相信,结果是一个干净的结果,并将增加图书馆在 10 多年内仍然有用的机会,但我没有特别喜欢鼓励人们认为这是一个顺序库,侧面钉有一些平行袋.

We've made a lot of design tradeoffs to merge sequential and parallel streams. The result, I believe, is a clean one and will add to the library's chances of still being useful in 10+ years, but I don't particularly like the idea of encouraging people to think this is a sequential library with some parallel bags nailed on the side.

这篇关于当 .stream().parallel() 做同样的事情时,为什么 Collection.parallelStream() 存在?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆