从泊松分布中提取数字的性能较低 [英] Performance for drawing numbers from Poisson distribution with low mean

查看:109
本文介绍了从泊松分布中提取数字的性能较低的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了从C ++中的泊松分布中提取随机数,通常建议使用

In order to draw random number from a Poisson distribution in C++, it is generally advised to use

RNG_type rng;
std::poisson_distribution<size_t> d(1e-6);
auto r = d(rng);

每次调用std::poisson_distribution对象时,都会消耗整个随机位序列(例如32位,其中 std :: mt19937_64 ).令我惊讶的是,在这么低的平均值(mean = 1e-6)下,绝大多数情况下,只有几位足以确定要返回的值是0.然后可以将其他位缓存起来以备后用.

At each call of the std::poisson_distribution object, an entire sequence of random bits is consumed (e.g. 32 bits with std::mt19937, 64 bits for std::mt19937_64). It strikes me that with such low mean (mean = 1e-6), the vast majority of times, only a few bits are enough to determine that the value to return is 0. The other bits could then be cached for later use.

假设将设置为true的位序列与泊松分布的高返回值相关联,则使用均值1e-6时,任何不以19 true开头的序列都必须返回零!确实,

Assuming that a sequence of bits set to true is associated to a high returned value from the Poisson distribution, when using a mean of 1e-6, any sequence not starting with 19 trues necessarily returns a zero! Indeed,

1 - 1/2^19 < P(0, 1e-6) < 1 - 1/2^20

,其中P(n, r)表示从均值r的泊松分布得出n的概率.一种不浪费比特的算法将使用一半的时间,四分之一的时间,三分之八的时间....

, where P(n, r) denotes the probability of drawing n from a Poisson distribution with mean r. An algorithm that does not waste bits would use one bit half of the time, two bits a quarter of the times, three bits an eighth of the times, ....

是否存在一种可以通过绘制泊松数时消耗尽可能少的位来提高性能的算法?当我们认为平均值较低时,与std::poisson_distribution相比,还有其他方法可以改善性能吗?

Is there an algorithm out there that can improve performance by consuming as few bits as possible when drawing Poisson numbers? Is there another way to improve performance compared to std::poisson_distribution when we consider a low mean?

回应@ Jarod42的评论谁说

In response to @Jarod42's comment who said

想知道如果使用更少的位不会破坏等概率...

Wonder if using fewer bits don't break equiprobability...

我认为这不会破坏等概率.在模糊测试中,我考虑了一个具有简单bernoulli分布的相同问题.我以概率1/2^4采样为true,以概率1 - 1/2^4采样为false.函数drawWithoutWastingBits一旦在缓存中看到一个真值,就会停止,并且函数drawWastingBits会消耗4位,无论这些位是什么.

I don't think it would break equiprobability. In a vague attempt to test it, I consider the same question with a simple bernoulli distribution. I am sampling true with a probability 1/2^4 and sampling false with a probability 1 - 1/2^4. The function drawWithoutWastingBits stops as soon as it sees a true in the cache and the function drawWastingBits consumes 4 bits whatever these bits are.

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <random>

bool drawWithoutWastingBits(std::vector<bool>& cache, size_t& cache_index)
{
    /* 
        Get a true with probability 1/2^4 (=1/16=0.0625) and a false otherwise
    */

    size_t nbTrues = 0;
    while (cache[cache_index])
    {
        ++nbTrues;
        ++cache_index;
        if (nbTrues == 4)
        {
            return true;
        }
    }
    ++cache_index;
    return false;
}


bool drawWastingBits(std::vector<bool>& cache, size_t& cache_index)
{
    /* 
        Get a true with probability 1/2^4 (=1/16=0.0625) and a false otherwise
    */

    bool isAnyTrue = false;
    for (size_t i = 0 ; i < 4; ++i)
    {
        if (cache[cache_index])
        {
            isAnyTrue = true;
        }
        ++cache_index;
    }
    return !isAnyTrue;
}

int main()
{
    /*
        Just cache a lot of bits in advance in `cache`. The same sequence of bits will be used by both function.
        I am just caching way enough bits to make sure they don't run out of bits below
        I made sure to have the same number of zeros and ones so that any deviation is caused by the methodology and not by the RNG
    */

    // Produce cache
    std::vector<bool> cache;
    size_t nbBitsToCache = 1e7;
    cache.reserve(nbBitsToCache);
    for (size_t i = 0 ; i < nbBitsToCache/2 ; ++i)
    {
        cache.push_back(false);
        cache.push_back(true);
    }
    // Shuffle cache
    {
        std::mt19937 mt(std::random_device{}());
        std::shuffle(cache.begin(), cache.end(), mt);
    }


    // Draw without wasting bits
    {
        size_t nbDraws = 1e6;
        size_t cache_index = 0;
        std::pair<size_t, size_t> outcomes = {0,0};
        for (size_t r = 0 ; r < nbDraws ; ++r)
        {
            drawWithoutWastingBits(cache, cache_index) ? ++outcomes.first : ++outcomes.second;
            assert(cache_index <= cache.size());
        }   

        assert(outcomes.first + outcomes.second == nbDraws);
        std::cout << "Draw Without Wasting Bits: prob true = " << (double)outcomes.first / nbDraws << "\n";
    }


    // Draw wasting bits
    {
        size_t nbDraws = 1e6;
        size_t cache_index = 0;
        std::pair<size_t, size_t> outcomes = {0,0};
        for (size_t r = 0 ; r < nbDraws ; ++r)
        {
            drawWastingBits(cache, cache_index) ? ++outcomes.first : ++outcomes.second;
            assert(cache_index <= cache.size());
        }   

        assert(outcomes.first + outcomes.second == nbDraws);
        std::cout << "Draw Wit Wasting Bits: prob true = " << (double)outcomes.first / nbDraws << "\n";
    }
}

可能的输出

Draw Without Wasting Bits: prob true = 0.062832
Draw Wit Wasting Bits: prob true = 0.062363

推荐答案

Devroye的 非统一随机变量生成 (第505页和第86页)提到了按顺序搜索算法进行的反演.

Devroye's Non-Uniform Random Variate Generation, pp. 505 and 86, mentions an inversion by sequential search algorithm.

基于该算法,如果您知道mean远小于1,则如果您在[0,1]中生成统一的随机数u,则如果u <= exp(-mean),则Poisson变量将为0,否则大于0.

Based on that algorithm, if you know the mean is considerably less than 1, then if you generate a uniform random number u in [0, 1], the Poisson variable will be 0 if u <= exp(-mean), and greater than 0 otherwise.

如果均值较低并且可以容忍近似分布,则可以使用以下方法(请参阅"

If the mean is low and you can tolerate an approximate distribution, then you can use the following approach (see Appendix A of "The Discrete Gaussian for Differential Privacy"):

  1. 以有理数的形式表达mean,形式为numer/denom.例如,如果mean是固定值,则可以相应地预先计算numerdenom,例如在编译时.
  2. 随机生成伯努利(numer / denom)数(以概率numer / denom生成1,否则生成0).如果以这种方式生成1,则对Bernoulli(numer / (denom * 2)),Bernoulli(numer / (denom * 3))重复此步骤,依此类推,直到以此方式生成0.使用最小化位浪费的算法来生成这些数字,例如Lumbroso的Fast Dice Roller论文(2013)附录B中提到的算法或从那里修改并在我的此问题.
  3. 如果步骤2产生偶数个,则Poisson变量正好为0.
  4. 如果步骤2产生的奇数个数等于1,则泊松变量大于0,并且有必要使用更慢"的算法来仅对大于0的泊松变量进行采样.
  1. Express mean in the form of a rational number, in the form numer/denom. For example, if mean is a fixed value, then numer and denom can be precalculated accordingly, such as at compile time.
  2. Randomly generate a Bernoulli(numer / denom) number (generate 1 with probability numer / denom or 0 otherwise). If 1 was generated this way, repeat this step with Bernoulli(numer / (denom * 2)), Bernoulli(numer / (denom * 3)), and so on until 0 is generated this way. Generate these numbers using an algorithm that minimizes waste of bits, such as the one mentioned in Appendix B of Lumbroso's Fast Dice Roller paper (2013) or the "ZeroToOne" method modified from there and given in my section on Boolean conditions. See also this question.
  3. If step 2 produced an even number of ones, the Poisson variable is exactly 0.
  4. If step 2 produced an odd number of ones, the Poisson variable is greater than 0, and a "slower" algorithm is necessary that samples only Poisson variables greater than 0.

例如,假设平均值为1e-6(1/1000000),先生成伯努利(1/1000000)数,然后生成伯努利(1/2000000),依此类推,直到您以这种方式生成0.如果生成的偶数个偶数,则Poisson变量正好为0.否则,Poisson变量为1或更大,并且需要更慢"的算法.

For example, say the mean is 1e-6 (1/1000000), Generate a Bernoulli(1/1000000) number, then Bernoulli(1/2000000), etc. until you generate 0 this way. If an even number of ones were generated, then the Poisson variable is exactly 0. Otherwise, the Poisson variable is 1 or greater and a "slower" algorithm is necessary.

下面的算法是一个示例,该算法基于第505页和第86页的算法,但仅对Poisson变量1或更大的样本进行采样:

One example is the algorithm below, which is based on the one from pages 505 and 86, but only samples Poisson variables 1 or greater:

METHOD Poisson1OrGreater(mean)
 sum=Math.exp(-mean)
 prod=sum
 u=RNDRANGE(sum, 1)
 i=0
 while i==0 or u>sum
   prod*=mean/(i+1)
   sum+=prod
   i=i+1
 end
 return i
END METHOD

但是,此方法不是很健壮,特别是因为它使用接近1(浮点空间更稀疏)的数字而不是接近0的数字.

This method, though, is not very robust, especially since it uses numbers close to 1 (where the floating-point space is more sparse) rather than numbers close to 0.

编辑(5月7日):

请注意,n个独立的Poisson(mean)随机数的总和是分布的Poisson(mean*n)(第501页).因此,此答案中的上述讨论适用于n泊松随机数之和,只要它们的均值的n倍较小即可.例如,要生成平均值为1e-6的1000个泊松随机数之和,只需生成一个平均值为0.001的单个泊松随机数.这样可以节省调用随机数生成器的费用.

Note that the sum of n independent Poisson(mean) random numbers is Poisson(mean*n) distributed (p. 501). Thus, the discussion above in this answer applies to a sum of n Poisson random numbers as long as n times their mean remains small. For example, to generate a sum of 1000 Poisson random numbers with a mean of 1e-6, simply generate a single Poisson random number with a mean of 0.001. This will save considerably on calls to the random number generator.

编辑(5月13日):进行了常规编辑.

EDIT (May 13): Edited generally.

这篇关于从泊松分布中提取数字的性能较低的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆