随机根本不是随机的吗? [英] Random is barely random at all?

查看:108
本文介绍了随机根本不是随机的吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我这样做是为了测试randint的随机性:

I did this to test the randomness of randint:

>>> from random import randint
>>>
>>> uniques = []
>>> for i in range(4500):  # You can see I was optimistic.
...     x = randint(500, 5000)
...     if x in uniques:
...         raise Exception('We duped %d at iteration number %d' % (x, i))
...     uniques.append(x)
...
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
Exception: We duped 887 at iteration number 7

我尝试了大约10倍以上,并且获得的最佳结果是在中继器之前进行了121次迭代.这是您可以从标准库中获得的最佳结果吗?

I tried about 10 times more and the best result I got was 121 iterations before a repeater. Is this the best sort of result you can get from the standard library?

推荐答案

生日悖论,或为什么PRNG产生重复的次数超出您的想象.


OP的问题有几个问题在起作用.第一个是如上所述的生日悖论,第二个是您所生成内容的本质,即本身并不能保证不会重复给定的数字.

The Birthday Paradox, or why PRNGs produce duplicates more often than you might think.


There are a couple of issues at play in the OP's problem. One is the birthday paradox as mentioned above and the second is the nature of what you are generating, which does not inherently guarantee that a given number will not be repeated.

生日悖论适用于给定值在生成器周期内多次出现的情况,因此在值样本中可能出现重复项.生日悖论的影响在于,获得此类重复的真实可能性非常大,并且两次重复之间的平均时间比原本可能认为的要短.感知概率与实际概率之间的这种不一致使生日悖论"成为认知偏向的一个很好的例子,其中天真的直观估计很可能是错误的.

The Birthday Paradox applies where given value can occur more than once during the period of the generator - and therefore duplicates can happen within a sample of values. The effect of the Birthday Paradox is that the real likelihood of getting such duplicates is quite significant and the average period between them is smaller than one might otherwise have thought. This dissonance between the perceived and actual probabilities makes the Birthday Paradox a good example of a cognitive bias, where a naive intuitive estimate is likely to be wildly wrong.

伪随机数生成器(PRNG)快速入门

问题的第一部分是您要获取随机数生成器的公开值并将其转换为更小的数字,因此可能值的空间减少了.尽管某些伪随机数生成器在其期间不重复值,但这种转换将域更改为小得多.较小的域会使不重复"条件无效,因此可以预期出现重复的可能性很大.

The first part of your problem is that you are taking the exposed value of a random number generator and converting it to a much smaller number, so the space of possible values is reduced. Although some pseudo-random number generators do not repeat values during their period this transformation changes the domain to a much smaller one. The smaller domain invalidates the 'no repeats' condition so you can expect a significant likelihood of repeats.

某些算法,例如线性同余PRNG (A'=AX|M) do 保证整个期间的唯一性.在LCG中,生成的值包含累加器的整个状态,并且不保留任何其他状态.生成器是确定性的,不能在周期内重复一个数字-任何给定的累加器值都只能表示一个可能的连续值.因此,每个值只能在生成器周期内出现一次.但是,这种PRNG的周期相对较小-对于LCG算法的典型实现,约为2 ^ 30-并且不可能大于不同值的数量.

Some algorithms, such as the linear congruential PRNG (A'=AX|M) do guarantee uniqueness for the entire period. In an LCG the generated value contains the entire state of the accumulator and no additional state is held. The generator is deterministic and cannot repeat a number within the period - any given accumulator value can imply only one possible successive value. Therefore, each value can only occur once within the period of the generator. However, the period of such a PRNG is relatively small - about 2^30 for typical implementations of the LCG algorithm - and cannot possibly be larger than the number of distinct values.

并非所有PRNG算法都具有此特征;有些可以在一段时间内重复给定的值.在OP的问题中, Mersenne Twister 算法(用于Python的

Not all PRNG algorithms share this characteristic; some can repeat a given value within the period. In the OP's problem, the Mersenne Twister algorithm (used in Python's random module) has a very long period - much greater than 2^32. Unlike a Linear Congruential PRNG, the result is not purely a function of the previous output value as the accumulator contains additional state. With 32-bit integer output and a period of ~2^19937, it cannot possibly provide a such a guarantee.

Mersenne Twister是PRNG的一种流行算法,因为它具有良好的统计和几何特性,并且具有很长的周期-这是用于仿真模型的PRNG的理想特性.

The Mersenne Twister is a popular algorithm for PRNGs because it has good statistical and geometric properties and a very long period - desirable characteristics for a PRNG used on simulation models.

  • 良好的统计属性表示算法生成的数字是均匀分布,没有数字出现的可能性比其他数字高得多.统计属性差可能会导致结果出现不希望的偏斜.

  • Good statistical properties mean that the numbers generated by the algorithm are evenly distributed with no numbers having a significantly higher probability of appearing than others. Poor statistical properties could produce unwanted skew in the results.

好的几何属性意味着可以N个数中的N个不在N维空间的超平面上.不良的几何特性会在仿真模型中生成虚假的相关性,并使结果失真.

Good geometric properies mean that sets of N numbers do not lie on a hyperplane in N-dimensional space. Poor geometric properties can generate spurious correlations in a simulation model and distort the results.

较长的时间意味着您可以在序列重新开始之前生成大量数字.如果模型需要大量迭代或必须从多个种子运行,则典型LCG实现中可用的2 ^ 30左右离散数可能不足. MT19337算法的周期非常长-2 ^ 19337-1,或大约10 ^ 5821.相比之下,宇宙中的原子总数估计约为10 ^ 80.

A long period means that you can generate a lot of numbers before the sequence wraps around to the start. If a model needs a large number of iterations or has to be run from several seeds then the 2^30 or so discrete numbers available from a typical LCG implementation may not be sufficient. The MT19337 algorithm has a very long period - 2^19337-1, or about 10^5821. By comparison, the total number of atoms in the universe is estimated at about 10^80.

MT19337 PRNG生成的32位整数可能无法表示足够的离散值,以免在如此大的时间段内重复.在这种情况下,足够大的样本可能会出现重复值,并且不可避免.

The 32-bit integer produced by an MT19337 PRNG cannot possibly represent enough discrete values to avoid repeating during such a large period. In this case, duplicate values are likely to occur and inevitable with a large enough sample.

简而言之生日悖论

此问题最初定义为房间中任何两个人共享同一生日的概率.关键是房间中的 任何两个 人可以分享生日.人们倾向于天真地将问题解释为房间中某人与特定个人分享生日的概率,这是认知偏见,通常会使人们低估这种可能性.这是不正确的假设-不需要将匹配匹配到特定的个人,并且任何两个个人都可以匹配.

This problem is originally defined as the probability of any two people in the room sharing the same birthday. The key point is that any two people in the room could share a birthday. People tend to naively misinterpret the problem as the probability of someone in the room sharing a birthday with a specific individual, which is the source of the cognitive bias that often causes people to underestimate the probability. This is the incorrect assumption - there is no requirement for the match to be to a specific individual and any two individuals could match.

在任何两个人之间发生匹配的可能性要比与某个特定个体发生匹配的可能性高得多,因为该匹配不一定要到特定的日期.而是,您只需找到两个共享相同生日的人.从该图(可以在该主题的Wikipedia页面上找到)中,我们可以看到房间中只需要23个人,就有50%的机会找到以这种方式匹配的两个人.

The probability of a match occurring between any two individuals is much higher than the probability of a match to a specific individual as the match does not have to be to a specific date. Rather, you only have to find two individuals that share the same birthday. From this graph (which can be found on the Wikipedia page on the subject), we can see that we only need 23 people in the room for there to be a 50% chance of finding two that match in this way.

关于该主题的维基百科条目中,我们可以获得

From the Wikipedia entry on the subject we can get a nice summary. In the OP's problem, we have 4,500 possible 'birthdays', rather than 365. For a given number of random values generated (equating to 'people') we want to know the probability of any two identical values appearing within the sequence.

计算生日悖论"对OP问题的可能影响

对于100个数字的序列,我们有 可能匹配的对(请参见了解问题).第二,第三等,第二个可以匹配第三,第四等,依此类推),因此可能匹配的组合的数量不仅仅是100.

For a sequence of 100 numbers, we have pairs (see Understanding the Problem) that could potentially match (i.e. the first could match with the second, third etc., the second could match the third, fourth etc. and so on), so the number of combinations that could potentially match is rather more than just 100.

从 .下面的以下Python代码片段对匹配对的发生概率进行了简单的评估.

From . The following snippet of Python code below does a naive evaluation of the probability of a matching pair occurring.

# === birthday.py ===========================================
#
from math import log10, factorial

PV=4500          # Number of possible values
SS=100           # Sample size

# These intermediate results are exceedingly large numbers;
# Python automatically starts using bignums behind the scenes.
#
numerator = factorial (PV)          
denominator = (PV ** SS) * factorial (PV - SS)

# Now we need to get from bignums to floats without intermediate
# values too large to cast into a double.  Taking the logs and 
# subtracting them is equivalent to division.
#  
log_prob_no_pair = log10 (numerator) - log10 (denominator)

# We've just calculated the log of the probability that *NO*
# two matching pairs occur in the sample.  The probability
# of at least one collision is 1.0 - the probability that no 
# matching pairs exist.
#
print 1.0 - (10 ** log_prob_no_pair)

对于从4500个可能值的总体中采样的100个数字中发生的匹配,这会产生合理的外观结果 p = 0.669 . (也许有人可以验证这一点并在错误的地方发表评论).由此可见,OP观察到的匹配数字之间的游程长度似乎很合理.

This produces a sensible looking result of p=0.669 for a match occurring within 100 numbers sampled from a population of 4500 possible values. (Maybe someone could verify this and post a comment if it's wrong). From this we can see that the lengths of runs between matching numbers observed by the OP seem to be quite reasonable.

脚注:使用改组获得唯一的伪随机数序列

请参见下面来自S. Mark的答案获得一种保证唯一的随机数集的方法.张贴者所指的技术采用一组数字(您提供这些数字,以便使它们唯一),然后将它们随机排列.从改组后的数组中依次提取数字,可以得到一系列保证不会重复的伪随机数.

See this answer below from S. Mark for a means of getting a guaranteed unique set of random numbers. The technique the poster refers to takes an array of numbers (which you supply, so you can make them unique) and shuffles them into a random order. Drawing the numbers in sequence from the shuffled array will give you a sequence of pseudo-random numbers that are guaranteed not to repeat.

脚注:加密安全的PRNG

MT算法不是在密码学上安全,因为它相对容易推断出通过观察数字序列来生成发生器.其他算法,例如 Blum Blum Shub 用于加密应用程序,但可能不适用于模拟或一般随机算法号码申请.加密安全的PRNG可能很昂贵(也许需要大数运算),或者可能没有良好的几何特性.对于这种算法,主要要求是通过观察值序列来推断生成器的内部状态在计算上是不可行的.

The MT algorithm is not cryptographically secure as it is relatively easy to infer the internal state of the generator by observing a sequence of numbers. Other algorithms such as Blum Blum Shub are used for cryptographic applications but may be unsuitable for simulation or general random number applications. Cryptographically secure PRNGs may be expensive (perhaps requiring bignum calculations) or may not have good geometric properties. In the case of this type of algorithm, the primary requirement is that it should be computationally infeasible to infer the internal state of the generator by observing a sequence of values.

这篇关于随机根本不是随机的吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆