python scipy stats 帕累托拟合:它是如何工作的 [英] python scipy stats pareto fit: how does it work

查看:79
本文介绍了python scipy stats 帕累托拟合:它是如何工作的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

... 帮助和在线文档说函数 scipy.stats.pareto.fit 将要拟合的数据集作为变量,以及可选的 b(指数)、loc、比例.结果是三元组(指数、位置、比例)

... help and online documentation say the function scipy.stats.pareto.fit takes as variables the dataset to be fitted, and optionally b (the exponent), loc, scale. the result comes as triplet (exponent, loc, scale)

从相同分布生成数据应该导致拟合找到用于生成数据的参数,例如(使用python 3 colsole)

generating data from the same distribution should result in the fit finding the parameters used for generating the data, e.g. (using the python 3 colsole)

$  python
Python 3.3.0 (default, Dec 12 2012, 07:43:02) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

(在下面的代码行中省略了 python 控制台提示>>>")

(in code lines below leaving out the python console prompt ">>>")

dataset=scipy.stats.pareto.rvs(1.5,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

然而这会导致

(1.0, nan, 0.0)

(指数 1,应为 1.5)和

(exponent 1, should be 1.5) and

dataset=scipy.stats.pareto.rvs(1.1,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

结果

(1.0, nan, 0.0)

(指数 1,应为 1.1)和

(exponent 1, should be 1.1) and

dataset=scipy.stats.pareto.rvs(4,loc=2.0,scale=0.4,size=10000)    #generating data
scipy.stats.pareto.fit(dataset)

(指数应为 4,loc 应为 2,比例应为 0.4)in

(exponent should be 4, loc should be 2, scale should be 0.4) in

(1.0, nan, 0.0)

等等.调用拟合函数时给出另一个指数

etc. giving another exponent when calling the fit function

scipy.stats.pareto.fit(dataset,1.4)

总是准确地返回这个指数

returns always exactly this exponent

(1.3999999999999999, nan, 0.0)

显而易见的问题是:我是否完全误解了这个 fit 函数的目的,它的使用方式是否有所不同,还是只是被破坏了?

The obvious question would be: do I misunderstand the purpose of this fit function completely, is it used somehow differently, or is it simply broken?

备注:在有人提到 Aaron Clauset 的网页(http://tuvalu.santafe.edu/~aaronc/powerlaws/) 比 scipy.stats 方法更可靠,应该改用:这可能是真的,但它们也非常非常非常非常耗时对于 10000 个点的数据集,在普通 PC 上需要很多小时(可能是几天、几周、几年).

a remark: before someone mentions that dedicated functions like those given on Aaron Clauset's web pages (http://tuvalu.santafe.edu/~aaronc/powerlaws/) are more reliable than the scipy.stats methods and should be used instead: that may be true, but they are also very very very very time consuming and do for datasets of 10000 points take many many hours (maybe days, weeks, years) on a normal PC.

哦:拟合函数的参数不是分布的指数而是指数减1(但这并没有改变上述问题)

edit: oh: the parameter of the fit function is not the exponent of the distribution but exponent minus 1 (but this does not change the above issue)

推荐答案

fit 方法是一种非常通用且简单的方法,它对分布的非负似然函数 (self.nnlf) 进行优化.在像帕累托这样具有可以创建未定义区域的参数的分布中,一般方法不起作用.

The fit method is a very general and simple method that does optimize.fmin on the non-negative likelihood function (self.nnlf) for the distribution. In distributions like pareto which have parameters that can create undefined regions, the general method doesn't work.

特别是,当随机变量的值不适合分布的有效性域时,一般的 nnlf 方法返回inf".fmin"优化器不能很好地处理这个目标函数,除非你已经非常接近最终拟合地猜测了起始值.

In particular, the general nnlf method returns "inf" when the value of the random-variable doesn't fit into domain of validity of the distribution. The "fmin" optimizer doesn't play well with this objective function unless you have guessed the starting value very closely to the ultimate fit.

一般来说,.fit 方法需要对 pdf 的适用范围有限制的分布使用约束优化器.

In general, the .fit method needs to use a constrained optimizer for distributions where there are limits on the domain of applicability of the pdf.

这篇关于python scipy stats 帕累托拟合:它是如何工作的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆