使用 Streams API 对集合中的 n 个随机不同元素执行操作 [英] Perform operation on n random distinct elements from Collection using Streams API

查看:24
本文介绍了使用 Streams API 对集合中的 n 个随机不同元素执行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Java 8 中的 Streams API 从集合中检索 n 个唯一的随机元素以进行进一步处理,但没有多少运气.

I'm attempting to retrieve n unique random elements for further processing from a Collection using the Streams API in Java 8, however, without much or any luck.

更准确地说,我想要这样的东西:

More precisely I'd want something like this:

Set<Integer> subList = new HashSet<>();
Queue<Integer> collection = new PriorityQueue<>();
collection.addAll(Arrays.asList(1,2,3,4,5,6,7,8,9));
Random random = new Random();
int n = 4;
while (subList.size() < n) {
  subList.add(collection.get(random.nextInt()));
}
sublist.forEach(v -> v.doSomethingFancy());

我想尽可能高效地完成它.

I want to do it as efficiently as possible.

这能做到吗?

我的第二次尝试——虽然不完全是我的目标:

edit: My second attempt -- although not exactly what I was aiming for:

List<Integer> sublist = new ArrayList<>(collection);
Collections.shuffle(sublist);
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());

edit:第三次尝试(灵感来自 Holger),如果 coll.size() 很大,n 很小:

edit: Third attempt (inspired by Holger), which will remove a lot of the overhead of shuffle if coll.size() is huge and n is small:

int n = // unique element count
List<Integer> sublist = new ArrayList<>(collection);   
Random r = new Random();
for(int i = 0; i < n; i++)
    Collections.swap(sublist, i, i + r.nextInt(source.size() - i));
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());

推荐答案

改组方法的效果相当好,正如 fge 所建议的a> 在 评论走走在另一个答案.这是改组方法的通用版本:

The shuffling approach works reasonably well, as suggested by fge in a comment and by ZouZou in another answer. Here's a generified version of the shuffling approach:

static <E> List<E> shuffleSelectN(Collection<? extends E> coll, int n) {
    assert n <= coll.size();
    List<E> list = new ArrayList<>(coll);
    Collections.shuffle(list);
    return list.subList(0, n);
}

我会注意到使用 subList 比获取流然后调用 limit(n) 更可取,如其他一些答案所示,因为结果流具有一个已知的大小,可以更有效地分割.

I'll note that using subList is preferable to getting a stream and then calling limit(n), as shown in some other answers, because the resulting stream has a known size and can be split more efficiently.

改组方法有几个缺点.它需要复制出所有元素,然后需要将所有元素打乱.如果元素总数很大而要选择的元素数量很少,这可能会非常昂贵.

The shuffling approach has a couple disadvantages. It needs to copy out all the elements, and then it needs to shuffle all the elements. This can be quite expensive if the total number of elements is large and the number of elements to be chosen is small.

OP 和其他几个答案建议的一种方法是随机选择元素,同时拒绝重复元素,直到选择了所需数量的唯一元素.如果要选择的元素数量相对于总数来说很小,这很有效,但随着要选择的数量增加,这会减慢很多,因为选择重复项的可能性也会增加.

An approach suggested by the OP and by a couple other answers is to choose elements at random, while rejecting duplicates, until the desired number of unique elements has been chosen. This works well if the number of elements to choose is small relative to the total, but as the number to choose rises, this slows down quite a bit because of the likelihood of choosing duplicates rises as well.

如果有一种方法可以在输入元素的空间上进行一次遍历并准确选择所需的数字,并且随机均匀地做出选择,那不是很好吗?事实证明,和往常一样,答案可以在 Knuth 中找到.参见 TAOCP 第 2 卷,第 3.4.2 节,随机采样和洗牌,算法 S.

Wouldn't it be nice if there were a way to make a single pass over the space of input elements and choose exactly the number wanted, with the choices made uniformly at random? It turns out that there is, and as usual, the answer can be found in Knuth. See TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling, Algorithm S.

简单来说,算法就是访问每个元素,根据访问的元素个数和选择的元素个数来决定是否选择.在 Knuth 的表示法中,假设您有 N 个元素,并且您想随机选择 n 个元素.应该以概率选择下一个元素

Briefly, the algorithm is to visit each element and decide whether to choose it based on the number of elements visited and the number of elements chosen. In Knuth's notation, suppose you have N elements and you want to choose n of them at random. The next element should be chosen with probability

(n - m)/(N - t)

其中 t 是到目前为止访问的元素数,m 是迄今为止选择的元素数.

where t is the number of elements visited so far, and m is the number of elements chosen so far.

这将给出所选元素的均匀分布一点都不明显,但显然确实如此.证明留给读者作为练习;请参阅本节的练习 3.

It's not at all obvious that this will give a uniform distribution of chosen elements, but apparently it does. The proof is left as an exercise to the reader; see Exercise 3 of this section.

鉴于此算法,通过循环遍历集合并基于随机测试添加到结果列表中,在传统"Java 中实现它是非常简单的.OP 询问了使用流的问题,所以这是一个尝试.

Given this algorithm, it's pretty straightforward to implement it in "conventional" Java by looping over the collection and adding to the result list based on the random test. The OP asked about using streams, so here's a shot at that.

算法 S 显然不适合 Java 流操作.它完全按顺序描述,关于是否选择当前元素的决定取决于一个随机决定加上从所有先前决定派生的状态.这可能使它看起来本质上是连续的,但我之前一直错了.我只想说,如何让这个算法并行运行并不是很明显.

Algorithm S doesn't lend itself obviously to Java stream operations. It's described entirely sequentially, and the decision about whether to select the current element depends on a random decision plus state derived from all previous decisions. That might make it seem inherently sequential, but I've been wrong about that before. I'll just say that it's not immediately obvious how to make this algorithm run in parallel.

不过,有一种方法可以使此算法适应流.我们需要的是一个状态谓词.这个谓词将根据当前状态确定的概率返回一个随机结果,并且状态将被更新——是的,变异——基于这个随机结果.这似乎很难并行运行,但至少在它从并行流运行的情况下很容易使线程安全:只需使其同步即可.但是,如果流是并行的,它将降级为顺序运行.

There is a way to adapt this algorithm to streams, though. What we need is a stateful predicate. This predicate will return a random result based on a probability determined by the current state, and the state will be updated -- yes, mutated -- based on this random result. This seems hard to run in parallel, but at least it's easy to make thread-safe in case it's run from a parallel stream: just make it synchronized. It'll degrade to running sequentially if the stream is parallel, though.

实现非常简单.Knuth 的描述使用了 0 到 1 之间的随机数,但是 Java Random 类让我们可以在半开区间内选择一个随机整数.因此,我们需要做的就是保持计数器有多少元素可以访问,还有多少元素可以选择,等等:

The implementation is pretty straightforward. Knuth's description uses random numbers between 0 and 1, but the Java Random class lets us choose a random integer within a half-open interval. Thus all we need to do is keep counters of how many elements are left to visit and how many are left to choose, et voila:

/**
 * A stateful predicate that, given a total number
 * of items and the number to choose, will return 'true'
 * the chosen number of times distributed randomly
 * across the total number of calls to its test() method.
 */
static class Selector implements Predicate<Object> {
    int total;  // total number items remaining
    int remain; // number of items remaining to select
    Random random = new Random();

    Selector(int total, int remain) {
        this.total = total;
        this.remain = remain;
    }

    @Override
    public synchronized boolean test(Object o) {
        assert total > 0;
        if (random.nextInt(total--) < remain) {
            remain--;
            return true;
        } else {
            return false;
        }
    }
}

现在我们有了我们的谓词,它很容易在流中使用:

Now that we have our predicate, it's easy to use in a stream:

static <E> List<E> randomSelectN(Collection<? extends E> coll, int n) {
    assert n <= coll.size();
    return coll.stream()
        .filter(new Selector(coll.size(), n))
        .collect(toList());
}

在 Knuth 的同一部分中也提到的另一种选择建议以 n/N 的恒定概率随机选择一个元素.如果您不需要正好选择 n 个元素,这很有用.它会平均选择 n 个元素,但当然会有一些变化.如果这是可以接受的,状态谓词就会变得简单得多.我们可以简单地创建随机状态并从局部变量中捕获它,而不是编写整个类:

An alternative also mentioned in the same section of Knuth suggests choosing an element at random with a constant probability of n / N. This is useful if you don't need to choose exactly n elements. It'll choose n elements on average, but of course there will be some variation. If this is acceptable, the stateful predicate becomes much simpler. Instead of writing a whole class, we can simply create the random state and capture it from a local variable:

/**
 * Returns a predicate that evaluates to true with a probability
 * of toChoose/total.
 */
static Predicate<Object> randomPredicate(int total, int toChoose) {
    Random random = new Random();
    return obj -> random.nextInt(total) < toChoose;
}

要使用它,请将上面流管道中的 filter 行替换为

To use this, replace the filter line in the stream pipeline above with

        .filter(randomPredicate(coll.size(), n))

最后,为了比较,这里是使用传统 Java 编写的选择算法的实现,即使用 for 循环并添加到集合中:

Finally, for comparison purposes, here's an implementation of the selection algorithm written using conventional Java, that is, using a for-loop and adding to a collection:

static <E> List<E> conventionalSelectN(Collection<? extends E> coll, int remain) {
    assert remain <= coll.size();
    int total = coll.size();
    List<E> result = new ArrayList<>(remain);
    Random random = new Random();

    for (E e : coll) {
        if (random.nextInt(total--) < remain) {
            remain--;
            result.add(e);
        }
    }            

    return result;
}

这很简单,这并没有什么问题.它比流方法更简单、更独立.尽管如此,流方法展示了一些有趣的技术,这些技术可能在其他上下文中有用.

This is quite straightforward, and there's nothing really wrong with this. It's simpler and more self-contained than the stream approach. Still, the streams approach illustrates some interesting techniques that might be useful in other contexts.

参考:

Knuth, Donald E. 计算机编程的艺术:第 2 卷,半数值算法,第 2 版. 版权所有 1981,1969 Addison-Wesley.

Knuth, Donald E. The Art of Computer Programming: Volume 2, Seminumerical Algorithms, 2nd edition. Copyright 1981, 1969 Addison-Wesley.

这篇关于使用 Streams API 对集合中的 n 个随机不同元素执行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆