评估分布拟合的优度 [英] Evaluate the goodness of a distributional fits

查看:79
本文介绍了评估分布拟合的优度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用以下代码拟合了样本数据的一些分布:

I have fitted some distributions for sample data with the following code:

import numpy as np 
import pylab
import matplotlib.pyplot as plt
from scipy.stats import norm

samp = norm.rvs(loc=0,scale=1,size=150) # (example) sample values. 

figprops = dict(figsize=(8., 7. / 1.618), dpi=128)                       
adjustprops = dict(left=0.1, bottom=0.1, right=0.97, top=0.93, wspace=0.2, hspace=0.2)

import pylab
fig = pylab.figure(**figprops)                                            
fig.subplots_adjust(**adjustprops)
ax = fig.add_subplot(1, 1, 1)  
ax.hist(samp,bins=10,density=True,alpha=0.6,color='grey', label='Data')
xmin, xmax = plt.xlim()

# Distributions. 
import scipy.stats
dist_names = ['beta', 'norm','gumbel_l'] 
for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(samp)
    x = np.linspace(xmin, xmax, 100) # 
    ax.plot(x,dist(*param).pdf(x),linewidth=4,label=dist_name)

ax.legend(fontsize=14)
plt.savefig('example.png')

如何自动在图例中从最合适(顶部)到最不合适的顺序排列发行名称?我在一个循环中生成了随机变量,最佳拟合的结果可能在每次迭代中都不同.

How do I order the distribution names in the legend from best fit (top) to worst fit automatically? I have generated in a loop random variables, the result of the best fit may be different each iteration.

推荐答案

好吧,您可以使用Kolmogorov-Smirnov(K-S)检验来计算f值并按其排序

Well, you could use Kolmogorov-Smirnov (K-S) test to compute, f.e., p-value and sort by it

修改循环

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(samp)
    x = np.linspace(xmin, xmax, 100) # 
    ax.plot(x,dist(*param).pdf(x),linewidth=4,label=dist_name)

    ks = scipy.stats.kstest(samp, dist_name, args=param)
    print((dist_name, ks))

您可能会得到类似

('beta', KstestResult(statistic=0.033975289251035434, pvalue=0.9951529119440156))
('norm', KstestResult(statistic=0.03164417055025992, pvalue=0.9982475331007705))
('gumbel_l', KstestResult(statistic=0.113229070386386, pvalue=0.039394595923043355))

告诉您正常和beta都不错,但gumbel应该排在最后.基于P值或统计信息的排序应易于添加

which tells you normal and beta are pretty good, but gumbel should be last. Sorting based on either P-value or statistics should be easy to add

您的结果可能会有所不同,并且取决于RNG的初始状态.

Your result might be different and would depend on RNG initial state.

更新

关于K-S检验不适用于拟合优度估计,我强烈不同意.我看不出有科学理由不使用它,而我本人则永远使用它.

Concerning non-applicability of the K-S test for goodness-of-fit estimate, I strongly disagree. I don't see scientific reason NOT to use it, and I used it myself for good.

通常情况下,您会使用黑匣子生成随机数据,比方说对网络延迟的一些测量

Typically, you have black box generating your random data, let's say some measurements of network delays

通常,它可以用混合Gamma来描述,您可以使用某种二次效用函数来拟合,并返回一组参数

In general, it could be described by mixture of Gammas, and you do your fit using some kind of quadratic utility function and get back set of parameters

然后,您使用K-S或任何其他经验与理论分布方法来估计拟合度.如果不使用K-S方法进行拟合,那么使用K-S就是一种很好的方法.

Then you use K-S or any other empirical vs theoretical distribution method to estimate how good fit is. If K-S method is not used to make a fit, then it is perfectly good approach to use K-S.

您基本上有一个黑盒生成数据,另一个黑盒拟合数据,并且想知道拟合度如何拟合数据.然后K-S会完成这项工作.

You basically have one black box generating data, another black box fitting data, and want to know how well fit fits the data. Then K-S will do the job.

然后声明它通常用作检验数据是否正常分布的正态性检验".以我的拙见,它已经完全关闭了.K-S大约是CDF-vs-CDF的最大差异,它不在乎正常性,它具有更多的通用性

And statement "it is commonly used as a test for normality to see if your data is normally distributed." is completely off, in my humble opinion. K-S is about CDF-vs-CDF maximum discrepancy, and it doesn't care about normalcy, it is a lot more universal

这篇关于评估分布拟合的优度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆