使用scipy拟合给定直方图的分布 [英] Fitting a distribution given the histogram using scipy

查看:608
本文介绍了使用scipy拟合给定直方图的分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用scipy(在我的情况下,使用weibull_min)拟合数据分布。是否可以在直方图而不是数据点的情况下执行此操作?就我而言,由于直方图具有大小为1的整数箱,所以我知道可以按以下方式推断数据:

 将numpy导入为np 
orig_hist = np.array([10,5,3,2,1])$ ​​b
$ b ext_data = reduce(lambda x,y :x + y,[i的[[i] * x,枚举(orig_hist)中的x]])

在这种情况下,ext_data将保存以下内容:

  [0,0,0,0,0,0,0,0 ,0,0,1,1,1,1,1,2,2,2,3,3,4] 

并使用以下方法构建直方图:

  np.histogram(ext_data,bins = 5)

相当于orig_hist



但是,鉴于我已经建立了直方图,我想避免外推数据并使用orig_hist拟合分布,但是我不知道是否可以在拟合过程中直接使用它。另外,是否有一个numpy函数可用于执行与我所示的推断类似的操作?

解决方案

我可能会误解某些东西,但是我相信拟合直方图正是您应该做的:您正在尝试估算概率密度。直方图尽可能接近潜在的概率密度。您只需对其进行归一化即可使其具有1的整数,或者允许您的拟合模型包含任意前置因子。

  import numpy as np 
import scipy.stats作为统计数据
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([10,5,3 ,2,1])$ ​​b $ b norm_hist = orig_hist / float(sum(orig_hist))

popt,pcov = opt.curve_fit(lambda x,c:stats.weibull_min.pdf(x, c),np.arange(len(norm_hist)),norm_hist)

plt.figure()
plt.plot(norm_hist,'o-',label ='norm_hist')
plt.plot(stats.weibull_min.pdf(np.arange(len(norm_hist)),popt),'s-',label ='Weibull_min fit')
plt.legend()

当然,对于给定的输入,Weibull拟合远不能令人满意:



更新


如上所述,Weibull_min不太适合您的样本输入。更大的问题是,它也不适合您的实际数据:

  orig_hist = np.array([23.,14.,13 。,12.,12.,12.,11.,11.,11.,11.,10.,10.,10.,10.,9.,9.,8.,8.,8.,8., 8、8、8、8、8、8、8、8、7、7、7、7、7、7、7、7、7、7、7、7 ,7.,7.,7.,6.,6.,6.,6.,6.,6.,6.,6.,6.,6.,6.,6.],dtype = np.float32) 


有直方图的两个主要问题。正如我所说,第一个是不太可能与Weibull_min分布相对应:它最大接近零并且尾巴很长,因此需要非平凡的Weibull参数组合。此外,直方图显然仅包含分布的一部分。这意味着我上面的归一化建议肯定会失败。您无法避免在自己的身体中使用任意比例尺参数。


我手动定义了比例缩放的Weibull拟合函数

  In [631]:popt 
Out [631]:array([1.10511850e + 02,8.82327822e-01,1.05206207e + 03])

最终拟合参数的顺序为(l,c,A),其形状参数约为 0.88 。这对应于发散的概率密度,这解释了为什么会弹出一些错误,指出


RuntimeWarning:功率中遇到的无效值


以及为什么没有适合 x = 0 的数据点。但是从数据和拟合之间的直观一致性来看,您可以评估结果是否可以接受。


如果您想超标,可以尝试使用 np.random.weibull 与这些参数,然后将所得直方图与您自己的直方图进行比较。


I would like to fit a distribution using scipy (in my case, using weibull_min) to my data. Is it possible to do this given the Histogram, and not the data points? In my case, because the histogram has integer bins of size 1, I know that I can extrapolate my data in the following way:

import numpy as np
orig_hist = np.array([10, 5, 3, 2, 1])

ext_data = reduce(lambda x,y: x+y, [[i]*x for i, x in enumerate(orig_hist)])

In this case, ext_data would hold this:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]

And building the histogram using:

np.histogram(ext_data, bins=5)

would be equivalent to orig_hist

Yet, given that I already have the histogram built, I would like to avoid extrapolating the data and use orig_hist to fit the distribution, but I don't know if it is possible to use it directly in the fitting procedure. Additionally, is there a numpy function that can be used to perform something similar to the extrapolation I showed?

解决方案

I might be misunderstanding something, but I believe that fitting to the histogram is exactly what you should do: you're trying to approximate the probability density. And the histogram is as close as you can get to the underlying probability density. You just have to normalize it in order to have an integral of 1, or allow your fitted model to contain an arbitrary prefactor.

import numpy as np
import scipy.stats as stats
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([10, 5, 3, 2, 1])
norm_hist = orig_hist/float(sum(orig_hist))

popt,pcov = opt.curve_fit(lambda x,c: stats.weibull_min.pdf(x,c), np.arange(len(norm_hist)),norm_hist)

plt.figure()
plt.plot(norm_hist,'o-',label='norm_hist')
plt.plot(stats.weibull_min.pdf(np.arange(len(norm_hist)),popt),'s-',label='Weibull_min fit')
plt.legend()

Of course for your given input the Weibull fit will be far from satisfactory:

Update

As I mentioned above, Weibull_min is a poor fit to your sample input. The bigger problem is that it is also a poor fit to your actual data:

orig_hist = np.array([ 23., 14., 13., 12., 12., 12., 11., 11., 11., 11., 10., 10., 10., 9., 9., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.], dtype=np.float32)

There are two main problems with this histogram. The first, as I said, is that it is unlikely to correspond to a Weibull_min distribution: it is maximal near zero and has a long tail, so it needs a non-trivial combination of Weibull parameters. Furthermore, your histogram clearly only contains a part of the distribution. This implies that my normalizing suggestion above is guaranteed to fail. You can't avoid using an arbitrary scale parameter in your fit.

I manually defined a scaled Weibull fitting function according to the formula on Wikipedia:

my_weibull = lambda x,l,c,A: A*float(c)/l*(x/float(l))**(c-1)*np.exp(-(x/float(l))**c)

In this function x is the independent variable, l is lambda (the scale parameter), c is k (the shape parameter) and A is a scaling prefactor. The faint upside of introducing A is that you don't have to normalize your histogram.

Now, when I dropped this function into scipy.optimize.curve_fit, I found what you did: it doesn't actually perform a fit, but sticks with the initial fitting parameters, whatever you set (using the p0 parameter; the default guesses are all 1 for every parametr). And curve_fit seems to think that the fitting converged.

After more than an hour's wall-related head-banging, I realized that the problem is that the singular behaviour at x=0 throws off the nonlinear least-squares algorithm. By excluding your very first data point, you get an actual fit to your data. I suspect that if we set c=1 and don't allow that to fit, then this problem might go away, but it is probably more informative to allow that to be fitted (so I didn't check).

Here's the corresponding code:

import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([ 23., 14., 13., 12., 12., 12., 11., 11., 11., 11., 10., 10., 10., 9., 9., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.], dtype=np.float32)

my_weibull = lambda x,l,c,A: A*float(c)/l*(x/float(l))**(c-1)*np.exp(-(x/float(l))**c)

popt,pcov = opt.curve_fit(my_weibull,np.arange(len(orig_hist))[1:],orig_hist[1:]) #throw away x=0!

plt.figure()
plt.plot(np.arange(len(orig_hist)),orig_hist,'o-',label='orig_hist')
plt.plot(np.arange(len(orig_hist)),my_weibull(np.arange(len(orig_hist)),*popt),'s-',label='Scaled Weibull fit')
plt.legend()

Result:

In [631]: popt
Out[631]: array([  1.10511850e+02,   8.82327822e-01,   1.05206207e+03])

the final fitted parameters are in the order (l,c,A), with the shape parameter of around 0.88. This corresponds to a diverging probability density, which explains why a few errors pop up saying

RuntimeWarning: invalid value encountered in power

and why there isn't a data point from the fitting for x=0. But judging from the visual agreement between data and fit, you can assess whether the result is acceptable or not.

If you want to overdo it, you can probably try generating points using np.random.weibull with these parameters, then comparing the resulting histograms with your own.

这篇关于使用scipy拟合给定直方图的分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆