如何使用 Python 中的科学库执行卡方拟合优度检验? [英] How to perform a chi-squared goodness of fit test using scientific libraries in Python?

查看:116
本文介绍了如何使用 Python 中的科学库执行卡方拟合优度检验?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一些凭经验获得的数据:

from scipy import stats大小 = 10000x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)

它呈指数分布(带有一些噪声),我想使用卡方拟合优度 (GoF) 测试来验证这一点.使用 Python 中的标准科学库(例如 scipy 或 statsmodels)以最少的手动步骤和假设来执行此操作的最简单方法是什么?

我可以拟合模型:

param = stats.expon.fit(x)plt.hist(x, normed=True, color='white', hat='/')plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))

计算 Kolmogorov-Smirnov 检验.

<预><代码>>>>stats.kstest(x, lambda x : stats.expon.cdf(x, *param))(0.0061000000000000004, 0.85077099515985011)

但是,我找不到计算卡方检验的好方法.

有一个 statsmodel 中的卡方 GoF 函数,但它假定为离散分布(且指数分布是连续的).

官方 scipy.stats 教程仅涵盖自定义分布和概率的案例是通过摆弄许多表达式(npoints、npointsh、nbound、normbound)来构建的,所以我不太清楚如何为其他分布做到这一点.chisquare 示例 假定预期值和已经获得了DoF.

此外,我不是在寻找一种手动"执行测试的方法,就像 已在此处讨论,但想知道如何应用可用的库函数之一.

解决方案

等概率 bin 的近似解决方案:

  • 估计分布的参数
  • 如果是 scipy.stats.distribution,则使用逆 cdf ppf 来获取常规概率网格的 binedge,例如distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
  • 然后,使用 np.histogram 计算每个 bin 中的观察次数

然后对频率使用卡方检验.

另一种方法是从排序数据的百分位数中找到 bin 边缘,并使用 cdf 来查找实际概率.

这只是近似值,因为卡方检验的理论假设参数是通过分箱数据的最大似然估计的.而且我不确定基于数据的binedges选择是否会影响渐近分布.

我好久没有研究这个了.如果近似解决方案不够好,那么我建议您在 stats.stackexchange 上提问.

Let's assume I have some data I obtained empirically:

from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)

It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?

I can fit a model with:

param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))

It is very elegant to calculate the Kolmogorov-Smirnov test.

>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)

However, I can't find a good way of calculating the chi-squared test.

There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).

The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.

Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.

解决方案

An approximate solution for equal probability bins:

  • Estimate the parameters of the distribution
  • Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
  • Then, use np.histogram to count the number of observations in each bin

then use chisquare test on the frequencies.

An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.

This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.

I haven't looked into this into a long time. If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.

这篇关于如何使用 Python 中的科学库执行卡方拟合优度检验?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆