在 python scipy 中实现 Kolmogorov Smirnov 测试 [英] Implementing a Kolmogorov Smirnov test in python scipy

查看:71
本文介绍了在 python scipy 中实现 Kolmogorov Smirnov 测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于 N 个数字的数据集,我想测试正态性.我知道 scipy.stats 有一个 kstest 功能但是没有关于如何使用它以及如何解释结果的示例.这里有人熟悉它可以给我一些建议吗?

I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice?

根据文档,使用 kstest 会返回两个数字,即 KS 检验统计量 D 和 p 值.如果 p 值大于显着性水平(比如 5%),那么我们不能拒绝数据来自给定分布的假设.

According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.

当我通过从正态分布中抽取 10000 个样本并测试高斯来进行测试时:

When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:

import numpy as np
from scipy.stats import kstest

mu,sigma = 0.07, 0.89
kstest(np.random.normal(mu,sigma,10000),'norm')

我得到以下输出:

(0.04957880905196102, 8.9249710700788814e-22)

(0.04957880905196102, 8.9249710700788814e-22)

p 值小于 5%,这意味着我们可以拒绝数据呈正态分布的假设.但是样本是从正态分布中抽取的!

The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!

有人能理解并向我解释这里的差异吗?

Can someone understand and explain to me the discrepancy here?

(正态性测试是否假设 mu = 0 和 sigma = 1?如果是这样,我如何测试我的数据是否呈高斯分布但具有不同的 mu 和 sigma?)

(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)

推荐答案

您的数据是使用 mu=0.07 和 sigma=0.89 生成的.您正在针对均值为 0 且标准差为 1 的正态分布测试此数据.

Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

零假设 (H0) 是您的数据作为样本的分布等于均值为 0,标准差为 1 的标准正态分布.

The null hypothesis (H0) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

较小的 p 值表示以概率 p 值预期会出现与 D 一样大的检验统计量.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

换句话说,(p 值为 ~8.9e-22)H0 不太可能是真的.

In other words, (with p-value ~8.9e-22) it is highly unlikely that H0 is true.

这是合理的,因为平均值和标准偏差不匹配.

That is reasonable, since the means and std deviations don't match.

将您的结果与:

In [22]: import numpy as np
In [23]: import scipy.stats as stats
In [24]: stats.kstest(np.random.normal(0,1,10000),'norm')
Out[24]: (0.007038739782416259, 0.70477679457831155)

要测试您的数据是否为高斯,您可以对其进行移位和重新缩放,使其正常,均值为 0,标准差为 1:

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

data=np.random.normal(mu,sigma,10000)
normed_data=(data-mu)/sigma
print(stats.kstest(normed_data,'norm'))
# (0.0085805670733036798, 0.45316245879609179)

<小时>

警告:(非常感谢 user333700(又名 scipy 开发人员 Josef Perktold))如果您不知道 musigma,估计参数使得 p 值无效:


Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu and sigma, estimating the parameters makes the p-value invalid:

import numpy as np
import scipy.stats as stats

mu = 0.3
sigma = 5

num_tests = 10**5
num_rejects = 0
alpha = 0.05
for i in xrange(num_tests):
    data = np.random.normal(mu, sigma, 10000)
    # normed_data = (data - mu) / sigma    # this is okay
    # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)
    normed_data = (data - data.mean()) / data.std()    # this is NOT okay
    # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
    D, pval = stats.kstest(normed_data, 'norm')
    if pval < alpha:
        num_rejects += 1
ratio = float(num_rejects) / num_tests
print('{}/{} = {:.2f} rejects at rejection level {}'.format(
    num_rejects, num_tests, ratio, alpha))     

印刷品

20/100000 = 0.00 rejects at rejection level 0.05 (not expected)

这表明 stats.kstest 可能不会拒绝预期数量的零假设如果使用样本的均值和标准差对样本进行归一化

which shows that stats.kstest may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation

normed_data = (data - data.mean()) / data.std()    # this is NOT okay

这篇关于在 python scipy 中实现 Kolmogorov Smirnov 测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆