返回很大范围内的非重复随机值 [英] Return non-duplicate random values from a very large range

查看:67
本文介绍了返回很大范围内的非重复随机值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一个函数,该函数将从n个整数(从0到n-1)的集合中产生k个伪随机值,而无需重复任何先前的结果. k小于或等于n. O(n)内存是不可接受的,因为n的大小很大,而且我需要重新洗牌的频率.

I would like a function that will produce k pseudo-random values from a set of n integers, zero to n-1, without repeating any previous result. k is less than or equal to n. O(n) memory is unacceptable because of the large size of n and the frequency with which I'll need to re-shuffle.

这些是我到目前为止考虑过的方法:

These are the methods I've considered so far:

阵列: 通常,如果我想要无重复的随机值,则可以对数组进行混洗,但这是O(n)内存. n可能太大而无法工作.

Array: Normally if I wanted duplicate-free random values I'd shuffle an array, but that's O(n) memory. n is likely to be too large for that to work.

long nextvalue(void) {
  static long array[4000000000];
  static int s = 0;
  if (s == 0) {
    for (int i = 0; i < 4000000000; i++) array[i] = i;
    shuffle(array, 4000000000);
  }
  return array[s++];
}

n状态PRNG : 可以设计多种随机数生成器,使其具有n的周期,并在该周期内访问n的唯一状态.最简单的示例是:

n-state PRNG: There are a variety of random number generators that can be designed so as to have a period of n and to visit n unique states over that period. The simplest example would be:

long nextvalue(void) {
static long s = 0;
static const long i = 1009; // assumed co-prime to n
  s = (s + i) % n;
  return s;
}

问题是,对于给定的n,动态设计良好的PRNG不一定容易,并且如果没有很多可变参数,PRNG也不太可能近似随机播放(甚至更难设计).但是也许有一个我不知道的好东西.

The problem with this is that it's not necessarily easy to design a good PRNG on the fly for a given n, and it's unlikely that that PRNG will approximate a fair shuffle if it doesn't have a lot of variable parameters (even harder to design). But maybe there's a good one I don't know about.

m位哈希: 如果集合的大小是2的幂,则可以设计一个完美的哈希函数f(),该函数执行范围内任何值到范围内其他值的1:1映射,其中每个输入都会产生一个唯一的输出.使用此功能,我可以简单地维护一个静态计数器s,并将生成器实现为:

m-bit hash: If the size of the set is a power of two, then it's possible to devise a perfect hash function f() which performs a 1:1 mapping from any value in the range to some other value in the range, where every input produces a unique output. Using this function I could simply maintain a static counter s, and implement a generator as:

long nextvalue(void) {
  static long s = 0;
  return f(s++);
}

这不是理想的,因为结果的顺序是由f()确定的,而不是由随机值确定的,因此它会遇到与上述相同的所有问题.

This isn't ideal because the order of the results is determined by f(), rather than random values, so it's subject to all the same problems as above.

NPOT哈希: 原则上,我可以使用与上述相同的设计原理来定义f()的版本,该版本可以在任意基础或什至是复合基础上工作,并且与所需范围兼容;但这可能很困难,而且我很可能会弄错.相反,可以为大于或等于n的2的下一个幂定义一个函数,并在此构造中使用该函数:

NPOT hash: In principle I can use the same design principles as above to define a version of f() which works in an arbitrary base, or even a composite, that is compatible with the range needed; but that's potentially difficult, and I'm likely to get it wrong. Instead a function can be defined for the next power of two greater than or equal to n, and used in this construction:

long nextvalue(void) {
  static long s = 0;
  long x = s++;
  do { x = f(x); } while (x >= n);
}

但是这个 still 也有相同的问题(不太可能给出近似的随机洗牌).

But this still have the same problem (unlikely to give a good approximation of a fair shuffle).

是否有更好的方法来处理这种情况?或者,也许我只需要f()的一个好函数,该函数高度可参数化并且易于设计以精确地访问n离散状态.

Is there a better way to handle this situation? Or perhaps I just need a good function for f() that is highly parameterisable and easy to design to visit exactly n discrete states.

我要考虑的是类似哈希的操作,我试图通过精心设计的映射使第一个j结果完全随机,然后jk之间的任何结果都可以简单地推断出来这种模式(尽管以可预测的方式).然后可以选择值j来找到合理的改组和可容忍的内存占用之间的折衷.

One thing I'm thinking of is a hash-like operation where I contrive to have the first j results perfectly random through carefully designed mapping, and then any results between j and k would simply extrapolate on that pattern (albeit in a predictable way). The value j could then be chosen to find a compromise between a fair shuffle and a tolerable memory footprint.

推荐答案

首先,对使用O(n)内存的任何东西都打折扣,然后讨论引用底层数组的解决方案似乎是不合理的.您有一个数组.随机播放.如果那行不通或不够快,请向我们提出有关此问题的信息.

First of all, it seems unreasonable to discount anything that uses O(n) memory and then discuss a solution that refers to an underlying array. You have an array. Shuffle it. If that doesn't work or isn't fast enough, come back to us with a question about it.

您只需要执行一次完整的改组.之后,从索引n中绘制,将该元素与随机位于其前面的元素交换,并增加n(以模数为单位).例如,对于如此大的数据集,我将使用类似的.

You only need to perform a complete shuffle once. After that, draw from index n, swap that element with an element located randomly before it and increase n, modulo element count. For example, with such a large dataset I'd use something like this.

素数是散列的一种选择,但可能与您的想法不同.使用两个Mersenne质数(lowhigh,也许是0xefff0xefffffff),您应该能够提出一种更加通用的哈希算法.

Prime numbers are an option for hashes, but probably not the same way you think. Using two Mersenne primes (low and high, perhaps 0xefff and 0xefffffff) you should be able to come up with a much more general-purpose hashing algorithm.

size_t hash(unsigned char *value, size_t value_size, size_t low, size_t high) {
    size_t x = 0;
    while (value_size--) {
        x += *value++;
        x *= low;
    }
    return x % high;
}
#define hash(value, value_size, low, high) (hash((void *) value, value_size, low, high))

例如,对于大于约两个八位字节的所有输入,这应该会产生一些分布良好的东西,但零字节前缀会出现一些麻烦的异常情况.您可能想以不同的方式对待它们.

This should produce something fairly well distributed for all inputs larger than about two octets for example, with the minor troublesome exception for zero byte prefixes. You might want to treat those differently.

这篇关于返回很大范围内的非重复随机值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆