N个整数随机排列的在线算法 [英] Online algorithm for random permutation of N integers

查看:18
本文介绍了N个整数随机排列的在线算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一个标准的置换函数,它接受一个整数并返回一个随机排列的前 N ​​个自然数的向量.如果您只需要 k (<= N) 个,但事先不知道 k,您是否仍然需要执行 O(N) 次排列?有没有比以下更好的算法:

Imagine a standard permute function that takes an integer and returns a vector of the first N natural numbers in a random permutation. If you only need k (<= N) of them, but don't know k beforehand, do you still have to perform a O(N) generation of the permutation? Is there a better algorithm than:

for x in permute(N):
    if f(x):
        break

我正在设想一个 API,例如:

I'm imagining an API such as:

p = permuter(N)
for x = p.next():
    if f(x):
        break

其中初始化为 O(1)(包括内存分配).

where the initialization is O(1) (including memory allocation).

推荐答案

这个问题通常被视为两种竞争算法之间的选择:

This question is often viewed as a choice between two competing algorithms:

  • 策略 FY:Fisher-Yates shuffle 的一种变体,其中对每个所需数字执行一个 shuffle 步骤,以及

  • Strategy FY: A variation on the Fisher-Yates shuffle where one shuffle step is performed for each desired number, and

策略 HT:将所有生成的数字保存在哈希表中.每一步都会产生随机数,直到找到一个不在哈希表中的数字.

Strategy HT: Keep all generated numbers in a hash table. At each step, random numbers are produced until a number which is not in the hash table is found.

根据kN之间的关系进行选择:如果k足够大,则使用策略FY;否则,策略 HT.论据是,如果 k 相对于 n 较小,那么维护一个大小为 n 的数组是浪费空间,并且会产生一个大的初始化成本.另一方面,随着 k 接近 n,越来越多的随机数需要被丢弃,最终产生新值将非常缓慢.

The choice is performed depending on the relationship between k and N: if k is sufficiently large, the strategy FY is used; otherwise, strategy HT. The argument is that if k is small relative to n, maintaining an array of size n is a waste of space, as well as producing a large initialization cost. On the other hand, as k approaches n more and more random numbers need to be discarded, and towards the end producing new values will be extremely slow.

当然,您可能事先不知道将请求的样本数量.在这种情况下,你可能会悲观地选择 FY,或者乐观地选择 HT,并希望最好.

Of course, you might not know in advance the number of samples which will be requested. In that case, you might pessimistically opt for FY, or optimistically opt for HT, and hope for the best.

其实没有真正需要权衡的,因为FY算法可以用哈希表高效实现.不需要初始化 N 个整数的数组.相反,哈希表仅用于存储其值与其索引不对应的数组元素.

In fact, there is no real need for trade-off, because the FY algorithm can be implemented efficiently with a hash table. There is no need to initialize an array of N integers. Instead, the hash-table is used to store only the elements of the array whose values do not correspond with their indices.

(以下描述使用基于 1 的索引;这似乎是问题所要寻找的内容.希望它不会充满一对一错误.因此它生成范围内的数字 [1,N].从这里开始,我使用 k 表示迄今为止请求的样本数量,而不是最终请求的数量.)

(The following description uses 1-based indexing; that seemed to be what the question was looking for. Hopefully it is not full of off-by-one errors. So it generates numbers in the range [1, N]. From here on, I use k for the number of samples which have been requested to date, rather than the number which will eventually be requested.)

在增量 FY 算法的每个点上,从 [k, N] 范围中随机选择一个索引 r.然后索引 kr 处的值被交换,之后 k 在下一次迭代中递增.

At each point in the incremental FY algorithm a single index r is chosen at random from the range [k, N]. Then the values at indices k and r are swapped, after which k is incremented for the next iteration.

作为一个效率点,请注意我们并不真正需要进行交换:我们只需在 r 处产生值,然后将 r 处的值设置为是 k 处的值.我们永远不会再查看索引 k 处的值,因此没有必要更新它.

As an efficiency point, note that we don't really need to do the swap: we simply yield the value at r and then set the value at r to be the value at k. We'll never again look at the value at index k so there is no point updating it.

最初,我们用哈希表模拟数组.要查找(虚拟)数组中索引 i 处的值,我们查看 i 是否存在于哈希表中:如果存在,那就是索引 处的值>i.否则索引 i 处的值是 i 本身.我们从一个空的哈希表开始(它可以节省初始化成本),它代表一个数组,它的每个索引的值都是索引本身.

Initially, we simulate the array with a hash table. To look up the value at index i in the (virtual) array, we see if i is present in the hash table: if so, that's the value at index i. Otherwise the value at index i is i itself. We start with an empty hash table (which saves initialization costs), which represents an array whose value at every index is the index itself.

为了进行 FY 迭代,对于每个样本索引 k,我们生成一个随机索引 r,产生该索引处的值,然后将值设置为索引 r 到索引 k 处的值.除了我们查找值的方式外,这正是上述 FY 的过程.

To do the FY iteration, for each sample index k we generate a random index r as above, yield the value at that index, and then set the value at index r to the value at index k. That's exactly the procedure described above for FY, except for the way we look up values.

这正好需要两次哈希表查找,一次插入(在已经查找过的索引处,理论上可以更快地完成),以及每次迭代生成一次随机数.这比策略 HT 的最佳情况多一次查找,但我们节省了一点,因为我们不需要循环来生成值.(当我们重新散列时,还有一个小的潜在节省,因为我们可以删除任何小于 k 的当前值的键.)

This requires exactly two hash-table lookups, one insertion (at an already looked-up index, which in theory can be done more quickly), and one random number generation for each iteration. That's one more lookup than strategy HT's best case, but we have a bit of a saving because we never need to loop to produce a value. (There is another small potential saving when we rehash because we can drop any keys smaller than the current value of k.)

随着算法的进行,哈希表会增长;使用标准的指数重新散列策略.在某些时候,哈希表将达到 N-k 个整数向量的大小.(由于哈希表开销,该点将在 k 的值远小于 N 时达到,但即使没有开销,该阈值也会在 N/2.)此时,不是重新散列,而是使用散列来创建现在非虚拟数组的尾部,该过程比重新散列花费的时间更少,而且不需要重复;剩余样本将使用标准增量 FY 算法进行选择.

As the algorithm proceeds, the hash table will grow; a standard exponential rehashing strategy is used. At some point, the hash table will reach the size of a vector of N-k integers. (Because of hash table overhead, this point will be reached at a value of k much less than N, but even if there were no overhead this threshold would be reached at N/2.) At that point, instead of rehashing, the hash is used to create the tail of the now non-virtual array, a procedure which takes less time than a rehash and never needs to be repeated; remaining samples will be selected using the standard incremental FY algorithm.

如果 k 最终达到阈值点,则此解决方案比 FY 稍慢,如果 k 永远不会变得足够大,则它比 HT 稍慢拒绝了.但在任何一种情况下它都不会慢很多,如果 k 有一个尴尬的值,它永远不会遭受病态的减速.

This solution is slightly slower than FY if k eventually reaches the threshold point, and it is slightly slower than HT if k never gets big enough for random numbers to be rejected. But it is not much slower in either case, and if never suffers from pathological slowdown when k has an awkward value.

如果不清楚,这里是一个粗略的 Python 实现:

In case that was not clear, here is a rough Python implementation:

from random import randint
def sampler(N):
  k = 1
  # First phase: Use the hash
  diffs = {}

  # Only do this until the hash table is smallish (See note)
  while k < N // 4:
    r = randint(k, N)
    yield diffs[r] if r in diffs else r
    diffs[r] = diffs[k] if k in diffs else k
    k += 1
  # Second phase: Create the vector, ignoring keys less than k
  vbase = k
  v = list(range(vbase, N+1))
  for i, s in diffs.items():
    if i >= vbase:
      v[i - vbase] = s
  del diffs
  # Now we can generate samples until we hit N
  while k <= N:
    r = randint(k, N)
    rv = v[r - vbase]
    v[r - vbase] = v[k - vbase]
    yield rv
    k += 1

注意:N//4 可能是悲观的;计算正确的值需要对哈希表实现了解太多.如果我真的关心速度,我会用编译语言编写自己的哈希表实现,然后我就知道了:)

Note: N // 4 is probably pessimistic; computing the correct value would require knowing too much about hash-table implementation. If I really cared about speed, I'd write my own hash table implementation in a compiled language, and then I'd know :)

这篇关于N个整数随机排列的在线算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆