从概率分布生成随机变量 [英] Generate random variables from a probability distribution

查看:82
本文介绍了从概率分布生成随机变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从我的 python 数据集中提取了一些变量,我想从我拥有的分布中生成一个更大的数据集.问题是我试图在保持类似行为的同时为新数据集引入一些可变性.这是我提取的包含 400 个观察值的数据的示例:

I have extracted some variables from my python data set and I want to generate a larger data set from the distributions I have. The problem is that I am trying to introduce some variability to the new data set while maintaining the similar behaviour. This is an example of my extracted data that consists of 400 observations:

Value    Observation Count     Ratio of Entries
1        352                    0.88
2        28                     0.07
3        8                      0.02
4        4                      0.01
7        4                      0.01
13       4                      0.01

现在我正在尝试使用这些信息来生成一个包含 2,000 个观察值的类似数据集.我知道 numpy.random.choicerandom.choice 函数,但我不想使用完全相同的分布.相反,我想根据分布生成随机变量(值列),但具有更多的可变性.我希望较大数据集的外观示例:

Now I am trying to use this information to generate a similar dataset with 2,000 observations. I am aware of the numpy.random.choice and the random.choice functions, but I do not want to use the exact same distributions. Instead I would like to generate random variables (the values column) based from the distribution but with more variability. An example of how I want my larger data set to look like:

Value         Observation Count        Ratio of Entries
1             1763                     0.8815
2             151                      0.0755
3             32                       0.0160
4             19                       0.0095
5             10                       0.0050
6             8                        0.0040
7             2                        0.0010
8             4                        0.0020
9             2                        0.0010
10            3                        0.0015
11            1                        0.0005
12            1                        0.0005
13            1                        0.0005
14            2                        0.0010
15            1                        0.0005

因此,如果我使用指数衰减函数拟合原始数据,则可以估计新分布,但是,我对连续变量不感兴趣.我如何解决这个问题,是否有与我正在尝试做的事情相关的特定或数学方法?

So the new distribution is something that could be estimated if I fitted my original data with an exponential decay function, however, I am not interested in continuous variables. How do I get around this and is there a particular or mathematical method relevant to what I am trying to do?

推荐答案

听起来您想根据第二个表中描述的 PDF 生成数据.PDF类似于

It sounds like you want to generate data based on the PDF described in the second table. The PDF is something like

0 for x <= B
A*exp(-A*(x-B)) for x > B

A 定义分布的宽度,它总是被归一化为面积为 1.B 是水平偏移,在您的情况下为零.您可以通过 ceil 分箱使其成为整数分布.

A defines the width of your distribution, which will always be normalized to have an area of 1. B is the horizontal offset, which is zero in your case. You can make it an integer distribution by binning with ceil.

归一化衰减指数的 CDF 是 1 - exp(-A*(x-B)).通常,自定义分布的一种简单方法是生成统一数并通过 CDF 映射它们.

The CDF of a normalized decaying exponential is 1 - exp(-A*(x-B)). Generally, a simple way to make a custom distribution is to generate uniform numbers and map them through the CDF.

幸运的是,您不必这样做,因为 scipy.stats.expon 已经提供了您正在寻找的实现.您所要做的就是拟合最后一列中的数据以获得 A(B 显然为零).您可以使用 curve_fit.请记住,A 映射到 scipy PDF 语言中的 1.0/scale.

Fortunately, you won't have to do that, since scipy.stats.expon already provides the implementation you are looking for. All you have to do is fit to the data in your last column to get A (B is clearly zero). You can easily do this with curve_fit. Keep in mind that A maps to 1.0/scale in scipy PDF language.

这是一些示例代码.通过计算整数输入的目标函数从 n-1n 的积​​分,我在这里增加了一层额外的复杂性,并在做合适的.

Here is some sample code. I've added an extra layer of complexity here by computing the integral of the objective function from n-1 to n for integer inputs, taking the binning into account for you when doing the fit.

import numpy as np
from scipy.optimize import curve_fit
from scipy.stats import expon

def model(x, a):
    return np.exp(-a * (x - 1)) - exp(-a * x)
    #Alternnative:
    # return -np.diff(np.exp(-a * np.concatenate(([x[0] - 1], x))))

x = np.arange(1, 16)
p = np.array([0.8815, 0.0755, ..., 0.0010, 0.0005])
a = curve_fit(model, x, p, 0.01)
samples = np.ceil(expon.rvs(scale=1/a, size=2000)).astype(int)
samples[samples == 0] = 1
data = np.bincount(samples)[1:]

这篇关于从概率分布生成随机变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆