Python自定义Zipf数字生成器执行不佳 [英] Python Custom Zipf Number Generator Performing Poorly

查看:94
本文介绍了Python自定义Zipf数字生成器执行不佳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个自定义的类似Zipf的数字生成器,因为numpy.random.zipf函数无法实现我所需要的.首先,它的alpha必须大于1.0,并且我需要一个0.5的alpha.其次,其基数与样本量直接相关,我需要制作比基数更多的样本,例如从仅6个唯一值的Zipfian分布中列出了1000个元素.

I needed a custom Zipf-like number generator because numpy.random.zipf function doesn't achieve what I need. Firstly, its alpha must be greater than 1.0 and I need an alpha of 0.5. Secondly, its cardinality is directly related to the sample size and I need to make more samples than the cardinality, e.g. make a list of 1000 elements from a Zipfian distribution of only 6 unique values.

@stanga发布了一个很好的解决方案.

@stanga posted a great solution to this.

import random 
import bisect 
import math 

class ZipfGenerator: 

    def __init__(self, n, alpha): 
        # Calculate Zeta values from 1 to n: 
        tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
        zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) 

        # Store the translation map: 
        self.distMap = [x / zeta[-1] for x in zeta] 

    def next(self): 
        # Take a uniform 0-1 pseudo-random value: 
        u = random.random()  

        # Translate the Zipf variable: 
        return bisect.bisect(self.distMap, u) - 1

对于固定基数nalpha可以小于1.0,并且采样可以是无限的.问题是它运行太慢.

The alpha can be less than 1.0 and the sampling can be infinite for a fixed cardinality n. The problem is that it runs too slow.

# Calculate Zeta values from 1 to n: 
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])

这两行是罪魁祸首.当我选择<7>我可以生成我在到10秒列表.我需要在n=5000000时执行此操作,但这是不可行的.我不完全理解为什么执行速度如此之慢,因为(我认为)它具有线性复杂度,并且浮点运算似乎很简单.我在好的服务器上使用的是Python 2.6.6.

These two lines are the culprits. When I choose n=50000 I can generate my list in ~10 seconds. I need to execute this when n=5000000 but it's not feasible. I don't fully understand why this is performing so slow because (I think) it has linear complexity and the floating point operations seem simple. I am using Python 2.6.6 on a good server.

是否可以进行优化或完全满足我要求的其他解决方案?

Is there an optimization I can make or a different solution altogether that meet my requirements?

编辑:我正在使用@ ev-br建议的修改,通过可能的解决方案来更新我的问题.我已将其简化为返回整个列表的子例程. @ ev-br建议将bisect更改为searchssorted是正确的,因为事实证明前者也是瓶颈.

EDIT: I'm updating my question with a possible solution using modifications recommended by @ev-br . I've simplified it as a subroutine that returns the entire list. @ev-br was correct to suggest changing bisect for searchssorted as the former proved to be a bottleneck as well.

def randZipf(n, alpha, numSamples): 
    # Calculate Zeta values from 1 to n: 
    tmp = numpy.power( numpy.arange(1, n+1), -alpha )
    zeta = numpy.r_[0.0, numpy.cumsum(tmp)]
    # Store the translation map: 
    distMap = [x / zeta[-1] for x in zeta]
    # Generate an array of uniform 0-1 pseudo-random values: 
    u = numpy.random.random(numSamples)
    # bisect them with distMap
    v = numpy.searchsorted(distMap, u)
    samples = [t-1 for t in v]
    return samples

推荐答案

让我先举一个小例子

In [1]: import numpy as np

In [2]: import math

In [3]: alpha = 0.1

In [4]: n = 5

In [5]: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]

In [6]: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])

In [7]: tmp
Out[7]: 
[1.0,
 0.9330329915368074,
 0.8959584598407623,
 0.8705505632961241,
 0.8513399225207846]

In [8]: zeta
Out[8]: 
[0,
 1.0,
 1.9330329915368074,
 2.82899145137757,
 3.699542014673694,
 4.550881937194479]

现在,让我们尝试从最里面的操作开始将其向量化. reduce调用本质上是一个累加的总和:

Now, let's try to vectorize it, starting from innermost operations. The reduce call is essentially a cumulative sum:

In [9]: np.cumsum(tmp)
Out[9]: array([ 1.        ,  1.93303299,  2.82899145,  3.69954201,  4.55088194])

您想要一个前导零,所以让它放在前面:

You want a leading zero, so let's prepend it:

In [11]: np.r_[0., np.cumsum(tmp)]
Out[11]: 
array([ 0.        ,  1.        ,  1.93303299,  2.82899145,  3.69954201,
        4.55088194])

您的tmp数组也可以一次性构建:

Your tmp array can be constructed in one go as well:

In [12]: tmp_vec = np.power(np.arange(1, n+1) , -alpha)

In [13]: tmp_vec
Out[13]: array([ 1.        ,  0.93303299,  0.89595846,  0.87055056,  0.85133992])

现在,肮脏的时机

In [14]: %%timeit 
   ....: n = 1000
   ....: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
   ....: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
   ....: 
100 loops, best of 3: 3.16 ms per loop

In [15]: %%timeit
   ....: n = 1000
   ....: tmp_vec = np.power(np.arange(1, n+1) , -alpha)
   ....: zeta_vec = np.r_[0., np.cumsum(tmp)]
   ....: 
10000 loops, best of 3: 101 µs per loop

现在,随着n的增加,它会变得更好:

Now, it gets better with increasing n:

In [18]: %%timeit
n = 50000
tmp_vec = np.power(np.arange(1, n+1) , -alpha)
zeta_vec = np.r_[0, np.cumsum(tmp)]
   ....: 
100 loops, best of 3: 3.26 ms per loop

In [19]: %%timeit 
n = 50000
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
   ....: 
1 loops, best of 3: 7.01 s per loop

在一行下,对bisect的调用可以替换为np.searchsorted.

Down the line, the call to bisect can be replaced by np.searchsorted.

编辑:一些与原始问题没有直接关系的评论,而是基于我对可能导致您绊倒的猜测:

A couple of comments which are not directly relevant to the original question, and are rather based on my guesses of what can trip you down the line:

  • 随机生成器应接受种子.您可以依赖numpy的全局np.random.seed,但最好使它成为默认为None的显式参数(表示不要播种).
  • 不需要
  • samples = [t-1 for t in v],只需return v-1.
  • 最好避免混合camelCase和pep8_lower_case_with_underscores.
  • 请注意,这与scipy.stats.rv_discrete的操作非常相似.如果您只需要采样,就可以了.如果您需要完整的发行版,可以考虑使用它.
  • a random generator should accept a seed. You can rely on numpy's global np.random.seed, but better make it an explicit argument defaulting to None (meaning do not seed it.)
  • samples = [t-1 for t in v] is not needed, just return v-1.
  • best avoid mixing camelCase and pep8_lower_case_with_underscores.
  • note that this is very similar to what scipy.stats.rv_discrete is doing. If you only need sampling, you're fine. If you need a full-fledged distribution, you may look into using it.

这篇关于Python自定义Zipf数字生成器执行不佳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆