模行为背后的数学 [英] mathematics behind modulo behavor

查看:32
本文介绍了模行为背后的数学的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

序言

这个问题与 (P)RNG 和 rand() 的行为无关.这是关于使用对模均匀分布的两个值的幂.

This question is not about the behavior of (P)RNG and rand(). It's about using power of two values uniformly distributed against modulo.

简介

我知道不应该使用模 % 将值从一个范围转换为另一个值,例如从 rand() 功能:会有偏差.这里解释了 https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default 和在这个答案 为什么人们说使用随机数生成器时存在模偏差?

I knew that one should not use modulo % to convert a value from a range to another, for example to get a value between 0 and 5 from the rand() function: there will be a bias. It's explained here https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default and in this answer Why do people say there is modulo bias when using a random number generator?

但是今天在调查了一些看起来有问题的代码之后,我制作了一个工具来演示模的行为:https://gitorious.org/modulo-test/modulo-test/trees/master 发现还不够清楚.

But today after investigating some code which was looking wrong, I've made a tool to demonstrate the behavor of modulo: https://gitorious.org/modulo-test/modulo-test/trees/master and found that's not clear enough.

骰子只有 3 位

我检查了 0..5 范围内的 6 个值.只需要 3 位就可以对这些值进行编码.

I checked with 6 values in range 0..5. Only 3 bits are needed to code those values.

$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
  [0..7] => [0..5]

theorical occurences    1666.67 probability 0.16666667

   [   0] occurences    2446    probability 0.24460000 ( +46.76%)
   [   1] occurences    2535    probability 0.25350000 ( +52.10%)
   [   2] occurences    1275    probability 0.12750000 ( -23.50%)
   [   3] occurences    1297    probability 0.12970000 ( -22.18%)
   [   4] occurences    1216    probability 0.12160000 ( -27.04%)
   [   5] occurences    1231    probability 0.12310000 ( -26.14%)

  minimum occurences    1216.00 probability 0.12160000 ( -27.04%)
  maximum occurences    2535.00 probability 0.25350000 ( +52.10%)
     mean occurences    1666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     639.43 probability 0.06394256 (  38.37%)

使用 3 位输入,结果确实很糟糕,但表现符合预期.查看答案 https://stackoverflow.com/a/14614899/611560

With 3 bits of input, the results are indeed awful, but behave as expected. See answer https://stackoverflow.com/a/14614899/611560

增加输入位数

令我困惑的是,增加输入位数会使结果不同.您不应忘记增加迭代次数,例如样本数,否则结果可能是错误的(请参阅错误统计).

What puzzled me, was increasing the number of input bits made the results different. You should not forgot to increase the number of iterations, eg the number of sample otherwise the results are likely wrong (see Wrong Statistics).

让我们尝试 4 位:

$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
  [0..15] => [0..5]

theorical occurences    3333.33 probability 0.16666667

   [   0] occurences    3728    probability 0.18640000 ( +11.84%)
   [   1] occurences    3763    probability 0.18815000 ( +12.89%)
   [   2] occurences    3675    probability 0.18375000 ( +10.25%)
   [   3] occurences    3721    probability 0.18605000 ( +11.63%)
   [   4] occurences    2573    probability 0.12865000 ( -22.81%)
   [   5] occurences    2540    probability 0.12700000 ( -23.80%)

  minimum occurences    2540.00 probability 0.12700000 ( -23.80%)
  maximum occurences    3763.00 probability 0.18815000 ( +12.89%)
     mean occurences    3333.33 probability 0.16666667 (  +0.00%)
   stddev occurences     602.48 probability 0.03012376 (  18.07%)

让我们尝试 5 位:

$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
  [0..31] => [0..5]

theorical occurences    6666.67 probability 0.16666667

   [   0] occurences    7462    probability 0.18655000 ( +11.93%)
   [   1] occurences    7444    probability 0.18610000 ( +11.66%)
   [   2] occurences    6318    probability 0.15795000 (  -5.23%)
   [   3] occurences    6265    probability 0.15662500 (  -6.03%)
   [   4] occurences    6334    probability 0.15835000 (  -4.99%)
   [   5] occurences    6177    probability 0.15442500 (  -7.34%)

  minimum occurences    6177.00 probability 0.15442500 (  -7.34%)
  maximum occurences    7462.00 probability 0.18655000 ( +11.93%)
     mean occurences    6666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     611.58 probability 0.01528949 (   9.17%)

让我们试试 6 位:

$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
  [0..63] => [0..5]

theorical occurences   13333.33 probability 0.16666667

   [   0] occurences   13741    probability 0.17176250 (  +3.06%)
   [   1] occurences   13610    probability 0.17012500 (  +2.08%)
   [   2] occurences   13890    probability 0.17362500 (  +4.18%)
   [   3] occurences   13702    probability 0.17127500 (  +2.77%)
   [   4] occurences   12492    probability 0.15615000 (  -6.31%)
   [   5] occurences   12565    probability 0.15706250 (  -5.76%)

  minimum occurences   12492.00 probability 0.15615000 (  -6.31%)
  maximum occurences   13890.00 probability 0.17362500 (  +4.18%)
     mean occurences   13333.33 probability 0.16666667 (  +0.00%)
   stddev occurences     630.35 probability 0.00787938 (   4.73%)

问题

请解释为什么更改输入位(并相应地增加样本计数)时结果不同?这些背后的数学推理是什么?

Please explain me why the results are different when changing the input bits (and increasing the sample count accordingly) ? What is the mathematical reasoning behind these ?

错误的统计数据

在之前版本的问题中,我展示了一个输入为 32bits 并且只有 1000000 次迭代的测试,例如 10^6 个样本,并说我很惊讶得到正确的结果.我很惭愧这是错误的:必须有 N 倍的样本才能有信心获得生成器的所有 2^32 个值.这里 10^6 是与 2^32 相比较小的方法.能够用数学/统计语言解释这一点的人的奖励..

In the previous version of the question, I showed a test with 32bits of input and only 1000000 iterations, eg 10^6 samples, and said I was surprised to get correct results. It was so wrong I'm ashamed of: there must be N times more samples to have confidence to get all 2^32 values of the generator. Here 10^6 is way to small compaired to 2^32. Bonus for people able to explain this in mathematical/statistical language..

这里是错误的结果:

$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
  [0..4294967295] => [0..5]

theorical occurences  166666.67 probability 0.16666667

   [   0] occurences  166881    probability 0.16688100 (  +0.13%)
   [   1] occurences  166881    probability 0.16688100 (  +0.13%)
   [   2] occurences  166487    probability 0.16648700 (  -0.11%)
   [   3] occurences  166484    probability 0.16648400 (  -0.11%)
   [   4] occurences  166750    probability 0.16675000 (  +0.05%)
   [   5] occurences  166517    probability 0.16651700 (  -0.09%)

  minimum occurences  166484.00 probability 0.16648400 (  -0.11%)
  maximum occurences  166881.00 probability 0.16688100 (  +0.13%)
     mean occurences  166666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     193.32 probability 0.00019332 (   0.12%)

我还是要反复阅读Zed Shaw的优秀文章 "程序员需要学习统计数据,否则我将杀光他们".

推荐答案

本质上,你在做:

(rand() & 7) % 6

假设 rand() 均匀分布在 [0;RAND_MAX],并且 RAND_MAX+1 是 2 的幂.很明显 rand() &7 可以计算为 0, 1, ..., 7,并且结果是等概率的.

Let's assume that rand() is uniformly distributed on [0; RAND_MAX], and that RAND_MAX+1 is a power of two. It is clear that rand() & 7 can evaluate to 0, 1, ..., 7, and that the outcomes are equiprobable.

现在让我们看看当你对结果取模 6 时会发生什么.

Now let's look at what happens when you take the result modulo 6.

  • 0 和 6 映射到 0;
  • 1 和 7 映射到 1;
  • 2 映射到 2;
  • 3 映射到 3;
  • 4 映射到 4;
  • 5 映射到 5.

这就解释了为什么你得到的零和一是得到其他数字的两倍.

This explains why you get twice as many zeroes and ones as you get the other numbers.

同样的事情发生在第二种情况.然而,额外"数字的值要小得多,使得它们的贡献与噪声无法区分.

The same thing is happening in the second case. However, the value of the "extra" numbers is much smaller, making their contribution indistinguishable from noise.

总而言之,如果你有一个在 [0 上均匀分布的整数;M-1],然后对 N 取模,结果将偏向于零,除非 M 可以被 N<整除/代码>.

To summarize, if you have an integer uniformly distributed on [0; M-1], and you take it modulo N, the result will be biased towards zero unless M is divisible by N.

这篇关于模行为背后的数学的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆