为什么 numpy.random.choice 不使用算术编码? [英] Why does numpy.random.choice not use arithmetic coding?

查看:94
本文介绍了为什么 numpy.random.choice 不使用算术编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我评估以下内容:

numpy.random.choice(2, size=100000, p=[0.01, 0.99])

使用一个均匀分布的随机float,说r,并判断r <0.01 大概会浪费许多生成的随机位(熵).我听说(二手)生成伪随机数在计算上很昂贵,所以我认为 numpy 不会这样做,而是会使用像 算术编码在这种情况下.

然而,起初一瞥看起来 choice 确实为它要求的每个样本生成了一个 float .此外,一个快速的 timeit 实验表明,生成 n 统一浮点实际上比来自 p=[0.01, 0.99] 的 n 个样本更快.

<预><代码>>>>timeit.timeit(lambda : numpy.random.choice(2, size=100000, p=[0.01, 0.99]), number=1000)1.74494537999999>>>timeit.timeit(lambda: numpy.random.random(size=100000), number=1000)0.8165735180009506

choice 真的会为每个样本生成一个 float 吗?在某些情况下(特别是在 size 很大且 p 分布不均匀的情况下)使用某种压缩算法不会显着提高性能吗?如果没有,为什么不呢?

解决方案

自 NumPy 1.17 起,原因主要是向后兼容.另请参阅这个问题这个问题.

从 NumPy 1.17 开始,numpy.random.* 函数(包括 numpy.random.choice)是遗留函数,并且应保持与当前相同";,根据 NumPy 的新 RNG 政策,它还为 NumPy 引入了一个新的随机生成系统.使它们成为遗留函数的原因包括避免全局状态的建议.尽管如此,NumPy 并未弃用 1.17 版中的任何 numpy.random.* 函数,尽管 NumPy 的未来版本可能会.

回想一下,在您的示例中,numpy.random.choice 将一个 float 数组作为权重.整数权重数组将导致更精确的随机数生成.尽管任何 float 都可以转换为有理数(导致有理值权重和整数权重),但传统的 NumPy 版本似乎没有这样做.numpy.random.choice 中的这些和其他实现决策不能在不破坏向后兼容性的情况下进行更改.

顺便说一下,算术编码并不是唯一一种避免浪费比特的算法.也许用于离散分布采样的规范算法是 Knuth 和 Yao 算法(1976),它根据所涉及的概率的二元展开精确地选择一个随机整数,并将问题视为二叉树上的随机游走.(该算法平均使用距离理论下界最多 2 位.)任何其他整数生成算法最终都可以用相同的方式描述,即在二叉树上进行随机游走.例如,Fast Loaded Dice Roller 是最近的算法,它对平均数有保证的界限它使用的位数(在这种情况下,距理论下限不超过 6 位).Han and Hoshi 算法(1997 年)是另一种,但使用累积概率.另请参阅我的部分带替换的加权选择".

If I evaluate something like:

numpy.random.choice(2, size=100000, p=[0.01, 0.99])

using one uniformly-distributed random float, say r, and deciding if r < 0.01 will presumably waste many of the random bits (entropy) generated. I've heard (second-hand) that generating psuedo-random numbers is computationally expensive, so I assumed that numpy would not be doing that, and rather would use a scheme like arithmetic coding in this case.

However, at first glance it appears that choice does indeed generate a float for every sample it is asked for. Further, a quick timeit experiment shows that generating n uniform floats is actually quicker than n samples from p=[0.01, 0.99].

>>> timeit.timeit(lambda : numpy.random.choice(2, size=100000, p=[0.01, 0.99]), number=1000)
1.74494537999999
>>> timeit.timeit(lambda : numpy.random.random(size=100000), number=1000)
0.8165735180009506

Does choice really generate a float for each sample, as it would appear? Would it not significantly improve performance to use some compression algorithm in some cases (particularly if size is large and p is distributed unevenly)? If not, why not?

解决方案

Since NumPy 1.17, the reason is largely backward compatibility. See also this question and this question.

As of NumPy 1.17, numpy.random.* functions, including numpy.random.choice, are legacy functions and "SHALL remain the same as they currently are", according to NumPy's new RNG policy, which also introduced a new random generation system for NumPy. The reasons for making them legacy functions include the recommendation to avoid global state. Even so, however, NumPy did not deprecate any numpy.random.* functions in version 1.17, although a future version of NumPy might.

Recall that in your examples, numpy.random.choice takes an array of floats as weights. An array of integer weights would lead to more exact random number generation. And although any float could be converted to a rational number (leading to rational-valued weights and thus integer weights), the legacy NumPy version appears not to do this. These and other implementation decisions in numpy.random.choice can't be changed without breaking backward compatibility.

By the way, arithmetic coding is not the only algorithm that seeks to avoid wasting bits. Perhaps the canonical algorithm for sampling for a discrete distribution is the Knuth and Yao algorithm (1976), which exactly chooses a random integer based on the binary expansion of the probabilities involved, and treats the problem as a random walk on a binary tree. (This algorithm uses, on average, up to 2 bits away from the theoretical lower bound.) Any other integer generating algorithm can be ultimately described in the same way, namely as a random walk on a binary tree. For example, the Fast Loaded Dice Roller is a recent algorithm that has a guaranteed bound on the average number of bits it uses (in this case, no more than 6 bits away from the theoretical lower bound). The Han and Hoshi algorithm (from 1997) is another of this kind, but uses cumulative probabilities. See also my section, "Weighted Choice With Replacement".

这篇关于为什么 numpy.random.choice 不使用算术编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆