排序和区分是否立即处理流? [英] Do sorted and distinct immediately process the stream?

查看:57
本文介绍了排序和区分是否立即处理流?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,我有这样的东西:

Stream<Integer> stream = Stream.of(2,1,3,5,6,7,9,11,10)
            .distinct()
            .sorted();

distinct()sorted()的javadocs都说它们是有状态的中间操作".这是否意味着该流在内部将执行类似创建哈希集,添加所有流值的操作,然后看到sorted()会将这些值放入已排序列表或已排序集中?还是比这更聪明?

换句话说,.distinct().sorted()是导致java两次遍历流还是将Java延迟到执行终端操作之前(例如.collect)?

解决方案

您已经提出了一个已加载的问题,这意味着必须在两种选择之间进行选择.

有状态中间操作必须存储数据,在某些情况下,必须先存储所有元素,然后才能将元素传递到下游,但这并不会改变这项工作推迟到终端操作完成之前的事实.已经开始.

说它必须两次遍历流"也是不正确的.例如,遍历遍历的方式完全不同.在sorted()的情况下,首先是对将要排序的内部缓冲区填充的源的遍历,其次是缓冲区的遍历.对于distinct(),在顺序处理中不会发生第二遍历,内部的HashSet仅用于确定是否将元素传递到下游.

所以当你跑步

Stream<Integer> stream = Stream.of(2,1,3,5,3)
    .peek(i -> System.out.println("source: "+i))
    .distinct()
    .peek(i -> System.out.println("distinct: "+i))
    .sorted()
    .peek(i -> System.out.println("sorted: "+i));
System.out.println("commencing terminal operation");
stream.forEachOrdered(i -> System.out.println("terminal: "+i));

它打印

 commencing terminal operation
source: 2
distinct: 2
source: 1
distinct: 1
source: 3
distinct: 3
source: 5
distinct: 5
source: 3
sorted: 1
terminal: 1
sorted: 2
terminal: 2
sorted: 3
terminal: 3
sorted: 5
terminal: 5
 

显示在开始终端操作之前什么也没有发生,并且源中的元素立即通过distinct()操作(除非重复),而所有元素在sorted()操作中进行缓冲,然后再向下传递. /p>

进一步可以证明distinct()不需要遍历整个流:

Stream.of(2,1,1,3,5,6,7,9,2,1,3,5,11,10)
    .peek(i -> System.out.println("source: "+i))
    .distinct()
    .peek(i -> System.out.println("distinct: "+i))
    .filter(i -> i>2)
    .findFirst().ifPresent(i -> System.out.println("found: "+i));

打印

 source: 2
distinct: 2
source: 1
distinct: 1
source: 1
source: 3
distinct: 3
found: 3
 

正如 Jose Da Silva的回答所解释和演示的那样,缓冲量可能会随着有序并行流而改变,部分是结果必须进行调整,然后才能传递到下游操作.

由于这些操作不会在知道实际的终端操作之前发生,因此可能存在比当前OpenJDK中更多的优化(但可能发生在不同的实现或将来的版本中).例如. sorted().toArray()可能使用并返回相同的数组,或者sorted().findFirst()可能会变成min()等.

Imagine I have something that looks like this:

Stream<Integer> stream = Stream.of(2,1,3,5,6,7,9,11,10)
            .distinct()
            .sorted();

The javadocs for both distinct() and sorted() say that they are "stateful intermediate operation". Does that mean that internally the stream will do something like create a hash set, add all the stream values, then seeing sorted() will throw those values into a sorted list or sorted set? Or is it smarter than that?

In other words, does .distinct().sorted() cause java to traverse the stream twice or does java delay that until a terminal operation is performed (such as .collect)?

解决方案

You have asked a loaded question, implying that there had to be a choice between two alternatives.

The stateful intermediate operations have to store data, in some cases up to the point of storing all elements before being able to pass an element downstream, but that doesn’t change the fact that this work is deferred until a terminal operation has been commenced.

It’s also not correct to say that it has to "traverse the stream twice". There are entirely different traversals going on, e.g. in the case of sorted(), first, the traversal of the source filling on internal buffer that will be sorted, second, the traversal of the buffer. In case of distinct(), no second traversal happens in the sequential processing, the internal HashSet is just used to determine whether to pass an element downstream.

So when you run

Stream<Integer> stream = Stream.of(2,1,3,5,3)
    .peek(i -> System.out.println("source: "+i))
    .distinct()
    .peek(i -> System.out.println("distinct: "+i))
    .sorted()
    .peek(i -> System.out.println("sorted: "+i));
System.out.println("commencing terminal operation");
stream.forEachOrdered(i -> System.out.println("terminal: "+i));

it prints

commencing terminal operation
source: 2
distinct: 2
source: 1
distinct: 1
source: 3
distinct: 3
source: 5
distinct: 5
source: 3
sorted: 1
terminal: 1
sorted: 2
terminal: 2
sorted: 3
terminal: 3
sorted: 5
terminal: 5

showing that nothing happens before the terminal operation has been commenced and that elements from the source immediately pass the distinct() operation (unless being duplicates), whereas all elements are buffered in the sorted() operation before being passed downstream.

It can further be shown that distinct() does not need to traverse the entire stream:

Stream.of(2,1,1,3,5,6,7,9,2,1,3,5,11,10)
    .peek(i -> System.out.println("source: "+i))
    .distinct()
    .peek(i -> System.out.println("distinct: "+i))
    .filter(i -> i>2)
    .findFirst().ifPresent(i -> System.out.println("found: "+i));

prints

source: 2
distinct: 2
source: 1
distinct: 1
source: 1
source: 3
distinct: 3
found: 3

As explained and demonstrated by Jose Da Silva’s answer, the amount of buffering may change with ordered parallel streams, as partial results must be adjusted before they can get passed to downstream operations.

Since these operations do not happen before the actual terminal operation is known, there are more optimizations possible than currently happen in OpenJDK (but may happen in different implementations or future versions). E.g. sorted().toArray() may use and return the same array or sorted().findFirst() may turn into a min(), etc.

这篇关于排序和区分是否立即处理流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆