为什么Java Streams一次性关闭? [英] Why are Java Streams once-off?

查看:102
本文介绍了为什么Java Streams一次性关闭?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与C#的 IEnumerable 不同,执行管道可以根据需要执行多次,在Java中,流只能迭代一次。

Unlike C#'s IEnumerable, where an execution pipeline can be executed as many times as we want, in Java a stream can be 'iterated' only once.

对终端操作的任何调用都会关闭流,使其无法使用。
这个'功能'带走了很多力量。

Any call to a terminal operation closes the stream, rendering it unusable. This 'feature' takes away a lot of power.

我想这是技术的原因。这个奇怪限制背后的设计考虑是什么?

I imagine the reason for this is not technical. What were the design considerations behind this strange restriction?

编辑:为了演示我在说什么,请考虑以下C#中快速排序的实现:

in order to demonstrate what I am talking about, consider the following implementation of Quick-Sort in C#:

IEnumerable<int> QuickSort(IEnumerable<int> ints)
{
  if (!ints.Any()) {
    return Enumerable.Empty<int>();
  }

  int pivot = ints.First();

  IEnumerable<int> lt = ints.Where(i => i < pivot);
  IEnumerable<int> gt = ints.Where(i => i > pivot);

  return QuickSort(lt).Concat(new int[] { pivot }).Concat(QuickSort(gt));
}

现在可以肯定的是,我并不是说这是一个很好的实现快速排序!然而,它是lambda表达式与流操作相结合的表现力的很好的例子。

Now to be sure, I am not advocating that this is a good implementation of quick sort! It is however great example of the expressive power of lambda expression combined with stream operation.

并且无法用Java完成!
我甚至无法询问流是否为空而不使其无法使用。

And it can't be done in Java! I can't even ask a stream whether it is empty without rendering it unusable.

推荐答案

我对Streams API的早期设计有一些回忆,可能会对设计原理有所了解。

I have some recollections from the early design of the Streams API that might shed some light on the design rationale.

早在2012年,我们在语言中添加了lambdas,我们想要使用lambdas编程的面向集合或批量数据的操作集,这将促进并行性。在这一点上,懒洋洋地将操作链接在一起的想法已经确立。我们也不希望中间操作存储结果。

Back in 2012, we were adding lambdas to the language, and we wanted a collections-oriented or "bulk data" set of operations, programmed using lambdas, that would facilitate parallelism. The idea of lazily chaining operations together was well established by this point. We also didn't want the intermediate operations to store results.

我们需要确定的主要问题是链中的对象在API中是什么样的以及它们是如何形成的连接到数据源。源通常是集合,但我们也希望支持来自文件或网络的数据,或者即时生成的数据,例如来自随机数生成器。

The main issues we needed to decide were what the objects in the chain looked like in the API and how they hooked up to data sources. The sources were often collections, but we also wanted to support data coming from a file or the network, or data generated on-the-fly, e.g., from a random number generator.

现有工作对设计有很多影响。其中影响力最大的是Google的 Guava 库和Scala馆藏库。 (如果有人对番石榴的影响感到惊讶,请注意,番石榴首席开发人员 Kevin Bourrillion 在< a href =https://jcp.org/en/jsr/detail?id=335\"rel =noreferrer> JSR-335 Lambda 专家组。)关于Scala集合,我们发现Martin的这个演讲Odersky特别感兴趣:未来的Scala集合:从Mutable到Persistent到Parallel 。 (Stanford EE380,2011年6月1日。)

There were many influences of existing work on the design. Among the more influential were Google's Guava library and the Scala collections library. (If anybody is surprised about the influence from Guava, note that Kevin Bourrillion, Guava lead developer, was on the JSR-335 Lambda expert group.) On Scala collections, we found this talk by Martin Odersky to be of particular interest: Future-Proofing Scala Collections: from Mutable to Persistent to Parallel. (Stanford EE380, 2011 June 1.)

我们当时的原型设计基于 Iterable 。熟悉的操作过滤器 map 等等是 Iterable上的扩展(默认)方法。调用一个操作添加了一个操作并返回另一个 Iterable 。像 count 这样的终端操作会将链上的 iterator()调用到源,并在每个操作中实现操作stage的Iterator。

Our prototype design at the time was based around Iterable. The familiar operations filter, map, and so forth were extension (default) methods on Iterable. Calling one added an operation to the chain and returned another Iterable. A terminal operation like count would call iterator() up the chain to the source, and the operations were implemented within each stage's Iterator.

由于这些是Iterables,你可以多次调用 iterator()方法。那么会发生什么?

Since these are Iterables, you can call the iterator() method more than once. What should happen then?

如果源是一个集合,这大部分工作正常。集合是可迭代的,每次调用 iterator()都会生成一个独立于任何其他活动实例的独特Iterator实例,并且每个实例都独立遍历集合。好的。

If the source is a collection, this mostly works fine. Collections are Iterable, and each call to iterator() produces a distinct Iterator instance that is independent of any other active instances, and each traverses the collection independently. Great.

现在如果源是一次性的,比如从文件中读取行,该怎么办?也许第一个Iterator应该获得所有值,但第二个和后续的Iterator应该是空的。也许这些值应该在迭代器之间交错。或者也许每个迭代器都应该得到所有相同的值。那么,如果你有两个迭代器而另一个比另一个更远呢?有人必须缓冲第二个迭代器中的值,直到它们被读取为止。更糟糕的是,如果你得到一个迭代器并读取所有值,只有然后得到第二个迭代器。这些价值从何而来?有没有要求所有人都被缓存以防万一有人想要第二个迭代器?

Now what if the source is one-shot, like reading lines from a file? Maybe the first Iterator should get all the values but the second and subsequent ones should be empty. Maybe the values should be interleaved among the Iterators. Or maybe each Iterator should get all the same values. Then, what if you have two iterators and one gets farther ahead of the other? Somebody will have to buffer up the values in the second Iterator until they're read. Worse, what if you get one Iterator and read all the values, and only then get a second Iterator. Where do the values come from now? Is there a requirement for them all to be buffered up just in case somebody wants a second Iterator?

显然,允许多个迭代器超过一个-shot来源提出了很多问题。我们没有他们的好答案。如果你两次调用 iterator(),我们需要一致的,可预测的行为。这促使我们禁止多次遍历,使管道一次性。

Clearly, allowing multiple Iterators over a one-shot source raises a lot of questions. We didn't have good answers for them. We wanted consistent, predictable behavior for what happens if you call iterator() twice. This pushed us toward disallowing multiple traversals, making the pipelines one-shot.

我们还观察到其他人遇到了这些问题。在JDK中,大多数Iterables是集合或类似集合的对象,它们允许多次遍历。它没有在任何地方指定,但似乎有一个不成文的期望Iterables允许多次遍历。一个值得注意的例外是NIO DirectoryStream 接口。它的规范包括这个有趣的警告:

We also observed others bumping into these issues. In the JDK, most Iterables are collections or collection-like objects, which allow multiple traversal. It isn't specified anywhere, but there seemed to be an unwritten expectation that Iterables allow multiple traversal. A notable exception is the NIO DirectoryStream interface. Its specification includes this interesting warning:


虽然DirectoryStream扩展了Iterable,但它不是通用的Iterable,因为它只支持一个单个迭代器;调用iterator方法获取第二个或后续迭代器会抛出IllegalStateException。

[原文加粗]

这似乎很不寻常和令人不快,我们不想创建一堆可能只有一次的新Iterables。这促使我们远离使用Iterable。

This seemed unusual and unpleasant enough that we didn't want to create a whole bunch of new Iterables that might be once-only. This pushed us away from using Iterable.

关于这一次,布鲁斯·埃克尔的文章出现了描述他与斯卡拉有过的麻烦。他写了这段代码:

About this time, an article by Bruce Eckel appeared that described a spot of trouble he'd had with Scala. He'd written this code:

// Scala
val lines = fromString(data).getLines
val registrants = lines.map(Registrant)
registrants.foreach(println)
registrants.foreach(println)

这很简单。它将文本行解析为注册人对象并将其打印两次。除了它实际上只打印出一次。事实证明,他认为注册人是一个集合,实际上它是一个迭代器。第二次调用 foreach 遇到一个空迭代器,所有值都已用完,所以它什么都没打印。

It's pretty straightforward. It parses lines of text into Registrant objects and prints them out twice. Except that it actually only prints them out once. It turns out that he thought that registrants was a collection, when in fact it's an iterator. The second call to foreach encounters an empty iterator, from which all values have been exhausted, so it prints nothing.

这种经验使我们确信,如果尝试多次遍历,那么获得明显可预测的结果非常重要。它还强调了区分类似管道的惰性结构与存储数据的实际集合的重要性。这反过来又将延迟管道操作分离到新的Stream接口,并且只在集合上直接保留急切的变异操作。 Brian Goetz已解释其理由。

This kind of experience convinced us that it was very important to have clearly predictable results if multiple traversal is attempted. It also highlighted the importance of distinguishing between lazy pipeline-like structures from actual collections that store data. This in turn drove the separation of the lazy pipeline operations into the new Stream interface and keeping only eager, mutative operations directly on Collections. Brian Goetz has explained the rationale for that.

怎么样?允许对基于集合的管道进行多次遍历,但是对于非基于集合的管道不允许这样做?这是不一致的,但这是明智的。如果您正在从网络中读取值,当然,则无法再次遍历它们。如果你想多次遍历它们,你必须明确地将它们拉到一个集合中。

What about allowing multiple traversal for collection-based pipelines but disallowing it for non-collection-based pipelines? It's inconsistent, but it's sensible. If you're reading values from the network, of course you can't traverse them again. If you want to traverse them multiple times, you have to pull them into a collection explicitly.

但是让我们探索允许从基于集合的管道进行多次遍历。假设你这样做了:

But let's explore allowing multiple traversal from collections-based pipelines. Let's say you did this:

Iterable<?> it = source.filter(...).map(...).filter(...).map(...);
it.into(dest1);
it.into(dest2);

进入操作现在拼写为 collect(toList())。)

如果source是一个集合,那么第一个 into()调用将创建一个迭代链返回源,执行管道操作,并将结果发送到目标。第二次调用进入()将创建另一个迭代器链,并再次执行管道操作 。这显然不是错误的,但它确实具有为每个元素第二次执行所有过滤器和映射操作的效果。我想很多程序员会对这种行为感到惊讶。

If source is a collection, then the first into() call will create a chain of Iterators back to the source, execute the pipeline operations, and send the results into the destination. The second call to into() will create another chain of Iterators, and execute the pipeline operations again. This isn't obviously wrong but it does have the effect of performing all the filter and map operations a second time for each element. I think many programmers would have been surprised by this behavior.

正如我上面提到的,我们一直在与Guava开发人员交谈。他们有一个很酷的事情是 Idea Graveyard ,他们描述了他们决定的功能<强>不实施原因。懒惰收藏的想法听起来很酷,但这就是他们对它的看法。考虑一个 List.filter()操作,该操作返回 List

As I mentioned above, we had been talking to the Guava developers. One of the cool things they have is an Idea Graveyard where they describe features that they decided not to implement along with the reasons. The idea of lazy collections sounds pretty cool, but here's what they have to say about it. Consider a List.filter() operation that returns a List:


这里最大的担忧是太多的操作成为昂贵的线性时间命题。如果要过滤列表并获取列表,而不仅仅是Collection或Iterable,则可以使用 ImmutableList.copyOf(Iterables.filter(list,predicate)),预先声明它正在做什么以及它有多贵。

The biggest concern here is that too many operations become expensive, linear-time propositions. If you want to filter a list and get a list back, and not just a Collection or an Iterable, you can use ImmutableList.copyOf(Iterables.filter(list, predicate)), which "states up front" what it's doing and how expensive it is.

举一个具体的例子,费用是多少列表中的 get(0) size()?对于常用的类,如 ArrayList ,它们是O(1)。但是如果你在一个延迟过滤的列表中调用其中一个,它必须在后备列表上运行过滤器,并且突然这些操作是O(n)。更糟糕的是,它必须遍历每次操作的支持列表。

To take a specific example, what's the cost of get(0) or size() on a List? For commonly used classes like ArrayList, they're O(1). But if you call one of these on a lazily-filtered list, it has to run the filter over the backing list, and all of a sudden these operations are O(n). Worse, it has to traverse the backing list on every operation.

这似乎是我们太多懒惰。设置一些操作并推迟实际执行是一回事,直到你Go为止。另一种方法是以隐藏大量重新计算的方式进行设置。

This seemed to us to be too much laziness. It's one thing to set up some operations and defer actual execution until you so "Go". It's another to set things up in such a way that hides a potentially large amount of recomputation.

在建议禁止非线性或无重用流时,< a shref =https://stackoverflow.com/users/4042945/paul-sandoz> Paul Sandoz 描述了潜在后果,允许它们产生意外或混乱的结果。他还提到并行执行会使事情更棘手。最后,我补充说,如果操作意外地执行了多次,或者至少与程序员预期的次数不同,那么带有副作用的管道操作将导致困难和模糊的错误。 (但Java程序员不会编写带副作用的lambda表达式,是吗?做他们吗?)

In proposing to disallow non-linear or "no-reuse" streams, Paul Sandoz described the potential consequences of allowing them as giving rise to "unexpected or confusing results." He also mentioned that parallel execution would make things even trickier. Finally, I'd add that a pipeline operation with side effects would lead to difficult and obscure bugs if the operation were unexpectedly executed multiple times, or at least a different number of times than the programmer expected. (But Java programmers don't write lambda expressions with side effects, do they? DO THEY??)

这就是Java 8 Streams API设计的基本原理允许单次遍历,这需要严格的线性(无分支)管道。它提供跨多个不同流源的一致行为,它清楚地将懒惰与急切操作分开,并且它提供了一种简单的执行模型。

So that's the basic rationale for the Java 8 Streams API design that allows one-shot traversal and that requires a strictly linear (no branching) pipeline. It provides consistent behavior across multiple different stream sources, it clearly separates lazy from eager operations, and it provides a straightforward execution model.

关于 IEnumerable ,我远不是C#和.NET的专家,所以如果我得出任何不正确的结论,我将不胜感激(温和地)。但是,确实会出现 IEnumerable 允许多个遍历在不同的源上表现不同;它允许嵌套 IEnumerable 操作的分支结构,这可能会导致一些重要的重新计算。虽然我理解不同的系统做出不同的权衡,但这些是我们在设计Java 8 Streams API时要避免的两个特征。

With regard to IEnumerable, I am far from an expert on C# and .NET, so I would appreciate being corrected (gently) if I draw any incorrect conclusions. It does appear, however, that IEnumerable permits multiple traversal to behave differently with different sources; and it permits a branching structure of nested IEnumerable operations, which may result in some significant recomputation. While I appreciate that different systems make different tradeoffs, these are two characteristics that we sought to avoid in the design of the Java 8 Streams API.

快速排序示例由OP很有意思,令人费解,我很遗憾地说,有点可怕。调用 QuickSort 需要 IEnumerable 并返回 IEnumerable ,所以在遍历最终的 IEnumerable 之前,实际上没有进行排序。但是,调用似乎是建立了一个 IEnumerables 的树结构,它反映了quicksort将要执行的分区,而不是实际执行它。 (毕竟这是懒惰的计算。)如果源有N个元素,那么树的最宽处是N个元素,并且它将是lg(N)个深度。

The quicksort example given by the OP is interesting, puzzling, and I'm sorry to say, somewhat horrifying. Calling QuickSort takes an IEnumerable and returns an IEnumerable, so no sorting is actually done until the final IEnumerable is traversed. What the call seems to do, though, is build up a tree structure of IEnumerables that reflects the partitioning that quicksort would do, without actually doing it. (This is lazy computation, after all.) If the source has N elements, the tree will be N elements wide at its widest, and it will be lg(N) levels deep.

在我看来 - 再次,我不是C#或.NET专家 - 这将导致某些看似无害的调用,例如通过 ints进行数据透视选择。 (),比他们看起来更贵。当然,在第一级,它是O(1)。但是考虑在树的深处,在右边缘。要计算此分区的第一个元素,必须遍历整个源,即O(N)操作。但由于上面的分区是惰性的,因此必须重新计算它们,需要进行O(lg N)比较。因此选择枢轴将是一个O(N lg N)操作,这与整个分类一样昂贵。

It seems to me -- and once again, I'm not a C# or .NET expert -- that this will cause certain innocuous-looking calls, such as pivot selection via ints.First(), to be more expensive than they look. At the first level, of course, it's O(1). But consider a partition deep in the tree, at the right-hand edge. To compute the first element of this partition, the entire source has to be traversed, an O(N) operation. But since the partitions above are lazy, they must be recomputed, requiring O(lg N) comparisons. So selecting the pivot would be an O(N lg N) operation, which is as expensive as an entire sort.

但是我们实际上并没有排序,直到我们遍历返回 IEnumerable 。在标准快速排序算法中,每个分区级别使分区数量加倍。每个分区只有一半大小,因此每个级别保持O(N)复杂度。分区树是O(lg N)高,因此总工作量是O(N lg N)。

But we don't actually sort until we traverse the returned IEnumerable. In the standard quicksort algorithm, each level of partitioning doubles the number of partitions. Each partition is only half the size, so each level remains at O(N) complexity. The tree of partitions is O(lg N) high, so the total work is O(N lg N).

使用懒惰的IEnumerables树,在底部树有N个分区。计算每个分区需要遍历N个元素,每个元素都需要在树上进行lg(N)比较。要计算树底部的所有分区,那么需要进行O(N ^ 2 lg N)比较。

With the tree of lazy IEnumerables, at the bottom of the tree there are N partitions. Computing each partition requires a traversal of N elements, each of which requires lg(N) comparisons up the tree. To compute all the partitions at the bottom of the tree, then, requires O(N^2 lg N) comparisons.

(这是对的吗?我简直不敢相信有人请帮我查一下。)

(Is this right? I can hardly believe this. Somebody please check this for me.)

无论如何,可以使用 IEnumerable 确实很酷这种方式可以构建复杂的计算结构。但是,如果它确实增加了我认为的计算复杂度,那么看起来这样的编程是应该避免的,除非一个人非常小心。

In any case, it is indeed cool that IEnumerable can be used this way to build up complicated structures of computation. But if it does increase the computational complexity as much as I think it does, it would seem that programming this way is something that should be avoided unless one is extremely careful.

这篇关于为什么Java Streams一次性关闭?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆