为什么 Java Streams 是一次性的? [英] Why are Java Streams once-off?

查看:30
本文介绍了为什么 Java Streams 是一次性的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与 C# 的 IEnumerable 不同,在 C# 中,执行管道可以根据需要执行多次,而在 Java 中,流只能迭代"一次.

Unlike C#'s IEnumerable, where an execution pipeline can be executed as many times as we want, in Java a stream can be 'iterated' only once.

对终端操作的任何调用都会关闭流,使其无法使用.这个功能"带走了很多力量.

Any call to a terminal operation closes the stream, rendering it unusable. This 'feature' takes away a lot of power.

我想这是不是技术的原因.这个奇怪的限制背后的设计考虑是什么?

I imagine the reason for this is not technical. What were the design considerations behind this strange restriction?

为了演示我在说什么,请考虑以下 C# 中快速排序的实现:

in order to demonstrate what I am talking about, consider the following implementation of Quick-Sort in C#:

IEnumerable<int> QuickSort(IEnumerable<int> ints)
{
  if (!ints.Any()) {
    return Enumerable.Empty<int>();
  }

  int pivot = ints.First();

  IEnumerable<int> lt = ints.Where(i => i < pivot);
  IEnumerable<int> gt = ints.Where(i => i > pivot);

  return QuickSort(lt).Concat(new int[] { pivot }).Concat(QuickSort(gt));
}

现在可以肯定的是,我并不是在提倡这是快速排序的一个很好的实现!然而,它是 lambda 表达式与流操作相结合的表现力的一个很好的例子.

Now to be sure, I am not advocating that this is a good implementation of quick sort! It is however great example of the expressive power of lambda expression combined with stream operation.

而且它不能在 Java 中完成!我什至不能在不使其无法使用的情况下询问流是否为空.

And it can't be done in Java! I can't even ask a stream whether it is empty without rendering it unusable.

推荐答案

我对 Streams API 的早期设计有一些回忆,这可能会阐明设计原理.

I have some recollections from the early design of the Streams API that might shed some light on the design rationale.

早在 2012 年,我们就在语言中添加了 lambda,我们想要一个面向集合或批量数据"的操作集,使用 lambda 进行编程,以促进并行性.在这一点上,懒惰地将操作链接在一起的想法已经很好地建立起来了.我们也不希望中间操作存储结果.

Back in 2012, we were adding lambdas to the language, and we wanted a collections-oriented or "bulk data" set of operations, programmed using lambdas, that would facilitate parallelism. The idea of lazily chaining operations together was well established by this point. We also didn't want the intermediate operations to store results.

我们需要决定的主要问题是链中的对象在 API 中的样子以及它们如何连接到数据源.来源通常是集合,但我们也希望支持来自文件或网络的数据,或者动态生成的数据,例如来自随机数生成器.

The main issues we needed to decide were what the objects in the chain looked like in the API and how they hooked up to data sources. The sources were often collections, but we also wanted to support data coming from a file or the network, or data generated on-the-fly, e.g., from a random number generator.

现有工作对设计有很多影响.其中影响较大的是 Google 的 Guava 库和 Scala 集合库.(如果有人对 Guava 的影响感到惊讶,请注意 Kevin Bourrillion,Guava 首席开发人员,在<一个 href="https://jcp.org/en/jsr/detail?id=335" rel="noreferrer">JSR-335 Lambda 专家组.)关于 Scala 集合,我们发现了 Martin 的这个演讲Odersky 特别感兴趣:面向未来的 Scala 集合:从可变到持久再到并行.(斯坦福 EE380,2011 年 6 月 1 日.)

There were many influences of existing work on the design. Among the more influential were Google's Guava library and the Scala collections library. (If anybody is surprised about the influence from Guava, note that Kevin Bourrillion, Guava lead developer, was on the JSR-335 Lambda expert group.) On Scala collections, we found this talk by Martin Odersky to be of particular interest: Future-Proofing Scala Collections: from Mutable to Persistent to Parallel. (Stanford EE380, 2011 June 1.)

我们当时的原型设计基于 Iterable.熟悉的操作filtermap 等是Iterable 上的扩展(默认)方法.调用一个向链中添加一个操作并返回另一个Iterable.像 count 这样的终端操作会在链上调用 iterator() 到源代码,并且这些操作在每个阶段的迭代器中实现.

Our prototype design at the time was based around Iterable. The familiar operations filter, map, and so forth were extension (default) methods on Iterable. Calling one added an operation to the chain and returned another Iterable. A terminal operation like count would call iterator() up the chain to the source, and the operations were implemented within each stage's Iterator.

由于这些是可迭代对象,您可以多次调用 iterator() 方法.那应该怎么办?

Since these are Iterables, you can call the iterator() method more than once. What should happen then?

如果源是一个集合,这通常可以正常工作.集合是可迭代的,每次调用 iterator() 都会产生一个独立于任何其他活动实例的独特 Iterator 实例,并且每个实例都独立地遍历集合.太好了.

If the source is a collection, this mostly works fine. Collections are Iterable, and each call to iterator() produces a distinct Iterator instance that is independent of any other active instances, and each traverses the collection independently. Great.

现在如果源是一次性的,比如从文件中读取行怎么办?也许第一个迭代器应该得到所有的值,但第二个和后续的应该是空的.也许这些值应该在迭代器之间交错.或者也许每个迭代器都应该获得所有相同的值.那么,如果您有两个迭代器并且一个比另一个更领先呢?有人将不得不缓冲第二个迭代器中的值,直到它们被读取.更糟糕的是,如果您获得一个 Iterator 并读取所有值,然后只然后获得第二个 Iterator,该怎么办?价值从何而来?是否需要将它们全部缓冲以防有人想要第二个迭代器?

Now what if the source is one-shot, like reading lines from a file? Maybe the first Iterator should get all the values but the second and subsequent ones should be empty. Maybe the values should be interleaved among the Iterators. Or maybe each Iterator should get all the same values. Then, what if you have two iterators and one gets farther ahead of the other? Somebody will have to buffer up the values in the second Iterator until they're read. Worse, what if you get one Iterator and read all the values, and only then get a second Iterator. Where do the values come from now? Is there a requirement for them all to be buffered up just in case somebody wants a second Iterator?

显然,在一次性源上允许多个迭代器会引发很多问题.我们没有给他们很好的答案.如果您调用 iterator() 两次,我们想要一致的、可预测的行为.这促使我们禁止多次遍历,使管道一次性.

Clearly, allowing multiple Iterators over a one-shot source raises a lot of questions. We didn't have good answers for them. We wanted consistent, predictable behavior for what happens if you call iterator() twice. This pushed us toward disallowing multiple traversals, making the pipelines one-shot.

我们还观察到其他人遇到了这些问题.在 JDK 中,大多数 Iterable 都是集合或类集合对象,允许多次遍历.它没有在任何地方指定,但似乎有一个不成文的期望,即 Iterables 允许多次遍历.一个值得注意的例外是 NIO DirectoryStream界面.它的规范包括这个有趣的警告:

We also observed others bumping into these issues. In the JDK, most Iterables are collections or collection-like objects, which allow multiple traversal. It isn't specified anywhere, but there seemed to be an unwritten expectation that Iterables allow multiple traversal. A notable exception is the NIO DirectoryStream interface. Its specification includes this interesting warning:

虽然 DirectoryStream 扩展了 Iterable,但它不是通用的 Iterable,因为它只支持单个 Iterator;调用迭代器方法以获取第二个或后续迭代器会抛出 IllegalStateException.

[原文粗体]

这看起来不寻常且令人不快,以至于我们不想创建一大堆可能只有一次的新 Iterable.这促使我们放弃使用 Iterable.

This seemed unusual and unpleasant enough that we didn't want to create a whole bunch of new Iterables that might be once-only. This pushed us away from using Iterable.

大约在这个时候,出现了一篇 Bruce Eckel 的文章,描述了他在使用 Scala 时遇到了麻烦.他写了这段代码:

About this time, an article by Bruce Eckel appeared that described a spot of trouble he'd had with Scala. He'd written this code:

// Scala
val lines = fromString(data).getLines
val registrants = lines.map(Registrant)
registrants.foreach(println)
registrants.foreach(println)

这很简单.它将文本行解析为 Registrant 对象并打印两次.除了它实际上只打印一次.事实证明,他认为 registrants 是一个集合,而实际上它是一个迭代器.对 foreach 的第二次调用遇到一个空迭代器,其中所有值都已耗尽,因此它什么也不打印.

It's pretty straightforward. It parses lines of text into Registrant objects and prints them out twice. Except that it actually only prints them out once. It turns out that he thought that registrants was a collection, when in fact it's an iterator. The second call to foreach encounters an empty iterator, from which all values have been exhausted, so it prints nothing.

这种经验使我们确信,如果尝试多次遍历,获得清晰可预测的结果是非常重要的.它还强调了将类似惰性管道的结构与存储数据的实际集合区分开来的重要性.这反过来又推动了将惰性管道操作分离到新的 Stream 接口中,并仅在集合上直接保留急切的、可变的操作.Brian Goetz 解释了这样做的理由.

This kind of experience convinced us that it was very important to have clearly predictable results if multiple traversal is attempted. It also highlighted the importance of distinguishing between lazy pipeline-like structures from actual collections that store data. This in turn drove the separation of the lazy pipeline operations into the new Stream interface and keeping only eager, mutative operations directly on Collections. Brian Goetz has explained the rationale for that.

对于基于集合的管道允许多次遍历,但对于非基于集合的管道不允许它如何?这是矛盾的,但它是明智的.如果您正在从网络读取值,当然您不能再次遍历它们.如果你想多次遍历它们,你必须明确地将它们拉入一个集合中.

What about allowing multiple traversal for collection-based pipelines but disallowing it for non-collection-based pipelines? It's inconsistent, but it's sensible. If you're reading values from the network, of course you can't traverse them again. If you want to traverse them multiple times, you have to pull them into a collection explicitly.

但是让我们探索允许从基于集合的管道进行多次遍历.假设您这样做了:

But let's explore allowing multiple traversal from collections-based pipelines. Let's say you did this:

Iterable<?> it = source.filter(...).map(...).filter(...).map(...);
it.into(dest1);
it.into(dest2);

(into 操作现在拼写为 collect(toList()).)

(The into operation is now spelled collect(toList()).)

如果源是一个集合,那么第一个 into() 调用将创建一个返回源的迭代器链,执行管道操作,并将结果发送到目标.对 into() 的第二次调用将创建另一个迭代器链,并再次执行管道操作.这显然没有错,但它确实具有为每个元素第二次执行所有过滤器和映射操作的效果.我想很多程序员都会对这种行为感到惊讶.

If source is a collection, then the first into() call will create a chain of Iterators back to the source, execute the pipeline operations, and send the results into the destination. The second call to into() will create another chain of Iterators, and execute the pipeline operations again. This isn't obviously wrong but it does have the effect of performing all the filter and map operations a second time for each element. I think many programmers would have been surprised by this behavior.

正如我上面提到的,我们一直在与 Guava 开发人员交谈.他们拥有的一项很酷的东西是 Idea Graveyard,他们在其中描述了他们决定的功能<强>不执行的原因.惰性集合的想法听起来很酷,但这是他们不得不说的.考虑一个返回 ListList.filter() 操作:

As I mentioned above, we had been talking to the Guava developers. One of the cool things they have is an Idea Graveyard where they describe features that they decided not to implement along with the reasons. The idea of lazy collections sounds pretty cool, but here's what they have to say about it. Consider a List.filter() operation that returns a List:

这里最大的问题是太多的操作变成了昂贵的线性时间命题.如果你想过滤一个列表并返回一个列表,而不仅仅是一个集合或一个可迭代的,你可以使用 ImmutableList.copyOf(Iterables.filter(list, predicate)),它说明前面"它在做什么以及它有多贵.

The biggest concern here is that too many operations become expensive, linear-time propositions. If you want to filter a list and get a list back, and not just a Collection or an Iterable, you can use ImmutableList.copyOf(Iterables.filter(list, predicate)), which "states up front" what it's doing and how expensive it is.

举个具体的例子,get(0)size() 在 List 上的成本是多少?对于像 ArrayList 这样常用的类,它们是 O(1).但是如果你在一个延迟过滤的列表上调用其中一个,它必须在支持列表上运行过滤器,突然间这些操作是 O(n).更糟糕的是,它必须在每次操作时遍历支持列表.

To take a specific example, what's the cost of get(0) or size() on a List? For commonly used classes like ArrayList, they're O(1). But if you call one of these on a lazily-filtered list, it has to run the filter over the backing list, and all of a sudden these operations are O(n). Worse, it has to traverse the backing list on every operation.

这在我们看来是太多懒惰.设置一些操作并推迟实际执行直到你开始"是一回事.以隐藏潜在大量重新计算的方式进行设置则是另一回事.

This seemed to us to be too much laziness. It's one thing to set up some operations and defer actual execution until you so "Go". It's another to set things up in such a way that hides a potentially large amount of recomputation.

在提议禁止非线性或不可重用"流时,Paul Sandoz 描述了潜在后果允许他们给予上升到意想不到或令人困惑的结果".他还提到并行执行会让事情变得更加棘手.最后,我要补充一点,如果操作意外执行多次,或者至少与程序员预期的次数不同,那么带有副作用的管道操作将导致难以理解的错误.(但 Java 程序员不会编写带有副作用的 lambda 表达式,是吗?他们会吗??)

In proposing to disallow non-linear or "no-reuse" streams, Paul Sandoz described the potential consequences of allowing them as giving rise to "unexpected or confusing results." He also mentioned that parallel execution would make things even trickier. Finally, I'd add that a pipeline operation with side effects would lead to difficult and obscure bugs if the operation were unexpectedly executed multiple times, or at least a different number of times than the programmer expected. (But Java programmers don't write lambda expressions with side effects, do they? DO THEY??)

这就是 Java 8 Streams API 设计的基本原理,它允许一次性遍历并且需要严格的线性(无分支)管道.它在多个不同的流源之间提供一致的行为,它清楚地将惰性操作与急切操作区分开来,并提供了一个简单的执行模型.

So that's the basic rationale for the Java 8 Streams API design that allows one-shot traversal and that requires a strictly linear (no branching) pipeline. It provides consistent behavior across multiple different stream sources, it clearly separates lazy from eager operations, and it provides a straightforward execution model.

关于 IEnumerable,我远非 C# 和 .NET 专家,因此如果我得出任何错误结论,我希望得到纠正(温和地).然而,看起来 IEnumerable 允许多次遍历对不同的源有不同的表现;它允许嵌套 IEnumerable 操作的分支结构,这可能会导致一些重要的重新计算.虽然我理解不同的系统会做出不同的权衡,但这是我们在 Java 8 Streams API 设计中试图避免的两个特征.

With regard to IEnumerable, I am far from an expert on C# and .NET, so I would appreciate being corrected (gently) if I draw any incorrect conclusions. It does appear, however, that IEnumerable permits multiple traversal to behave differently with different sources; and it permits a branching structure of nested IEnumerable operations, which may result in some significant recomputation. While I appreciate that different systems make different tradeoffs, these are two characteristics that we sought to avoid in the design of the Java 8 Streams API.

OP 给出的快速排序示例很有趣,令人费解,而且很抱歉,有点吓人.调用 QuickSort 需要一个 IEnumerable 并返回一个 IEnumerable,因此在遍历最终的 IEnumerable 之前不会真正进行排序.然而,这个调用似乎建立了一个 IEnumerables 的树结构,它反映了快速排序会做的分区,但实际上并没有这样做.(毕竟这是惰性计算.)如果源有 N 个元素,则树的最宽处将是 N 个元素,并且深度为 lg(N) 级.

The quicksort example given by the OP is interesting, puzzling, and I'm sorry to say, somewhat horrifying. Calling QuickSort takes an IEnumerable and returns an IEnumerable, so no sorting is actually done until the final IEnumerable is traversed. What the call seems to do, though, is build up a tree structure of IEnumerables that reflects the partitioning that quicksort would do, without actually doing it. (This is lazy computation, after all.) If the source has N elements, the tree will be N elements wide at its widest, and it will be lg(N) levels deep.

在我看来——再一次,我不是 C# 或 .NET 专家——这将导致某些看起来无害的调用,例如通过 ints.First()ints.First(),比他们看起来更贵.在第一层,当然是 O(1).但是考虑在树深处的右侧边缘的分区.要计算此分区的第一个元素,必须遍历整个源,这是一个 O(N) 操作.但是由于上面的分区是惰性的,它们必须重新计算,需要 O(lg N) 次比较.因此选择主元将是一个 O(N lg N) 操作,这与整个排序一样昂贵.

It seems to me -- and once again, I'm not a C# or .NET expert -- that this will cause certain innocuous-looking calls, such as pivot selection via ints.First(), to be more expensive than they look. At the first level, of course, it's O(1). But consider a partition deep in the tree, at the right-hand edge. To compute the first element of this partition, the entire source has to be traversed, an O(N) operation. But since the partitions above are lazy, they must be recomputed, requiring O(lg N) comparisons. So selecting the pivot would be an O(N lg N) operation, which is as expensive as an entire sort.

但在遍历返回的 IEnumerable 之前,我们实际上不会进行排序.在标准的快速排序算法中,每一级分区都会使分区数加倍.每个分区只有一半的大小,因此每个级别保持 O(N) 复杂度.分区树的高度为 O(lg N),所以总工作量是 O(N lg N).

But we don't actually sort until we traverse the returned IEnumerable. In the standard quicksort algorithm, each level of partitioning doubles the number of partitions. Each partition is only half the size, so each level remains at O(N) complexity. The tree of partitions is O(lg N) high, so the total work is O(N lg N).

懒惰IEnumerables树,在树的底部有N个分区.计算每个分区需要遍历 N 个元素,每个元素都需要在树上进行 lg(N) 次比较.为了计算树底部的所有分区,需要 O(N^2 lg N) 次比较.

With the tree of lazy IEnumerables, at the bottom of the tree there are N partitions. Computing each partition requires a traversal of N elements, each of which requires lg(N) comparisons up the tree. To compute all the partitions at the bottom of the tree, then, requires O(N^2 lg N) comparisons.

(这是真的吗?我简直不敢相信.请有人帮我检查一下.)

(Is this right? I can hardly believe this. Somebody please check this for me.)

无论如何,IEnumerable 可以用这种方式构建复杂的计算结构确实很酷.但是,如果它确实像我认为的那样增加了计算复杂性,那么除非非常小心,否则似乎应该避免以这种方式编程.

In any case, it is indeed cool that IEnumerable can be used this way to build up complicated structures of computation. But if it does increase the computational complexity as much as I think it does, it would seem that programming this way is something that should be avoided unless one is extremely careful.

这篇关于为什么 Java Streams 是一次性的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆