Spark (Python) 中的 Kolmogorov Smirnov 测试不起作用? [英] Kolmogorov Smirnov Test in Spark (Python) not working?
问题描述
我在 Python spark-ml 中进行了正态性测试,发现我认为是一个错误.
这是设置,我有一个标准化的数据集(范围 -1,到 1).
当我做直方图时,我可以清楚地看到数据不正常:
<预><代码>>>>price_norm.histogram(10)([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],[226, 269, 119, 95, 52, 26, 8, 2, 2, 5])当我运行 Kolmgorov-Smirnov 测试时,我得到以下结果:
<预><代码>>>>testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")>>>打印测试结果Kolmogorov-Smirnov 测试总结:自由度 = 0统计数据 = 0.46231145770077375pValue = 1.742039845709087E-11反对零假设的非常强的假设:样本遵循理论分布.Kolmgorov-Smirnov 检验将零假设 (H0) 定义为:数据遵循指定的分布(http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).
在这种情况下,p 值非常低,因此我们应该拒绝原假设.这是有道理的,因为这显然不正常.
那为什么会说:
样本遵循理论分布
这不是错了吗?难道不应该说样本不遵循理论分布吗?我错过了什么吗?
把我逼疯了,直接去看源码:
git://git.apache.org/spark.gitspark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala
代码正确,零假设设置为:
object NullHypothesis extends Enumeration {输入 NullHypothesis = 值val OneSampleTwoSided = Value("样本服从理论分布")}
字符串消息的措辞只是重申零假设:
非常强的反对零假设的假设:样本遵循理论分布.________________________________________H0
可以说这个措辞令人困惑,因为它可以双向解释.但这确实是正确的.
I was doing a normality test in Python spark-ml and saw what I think is an bug.
Here is the setup, i have a data-set that is normalized (range -1, to 1).
When I do a histogram, i can clearly see that the data is NOT normal:
>>> prices_norm.histogram(10)
([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
[226, 269, 119, 95, 52, 26, 8, 2, 2, 5])
When I run the Kolmgorov-Smirnov test I get the following results:
>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")
>>> print testResults
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.46231145770077375
pValue = 1.742039845709087E-11
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
The Kolmgorov-Smirnov test defines the null hypothesis (H0) as: the data follows a specified distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).
In this case the p-value is very low, so we should reject the null hypothesis. This makes sense, as it is clearly not normal.
So why then, does it say:
Sample follows theoretical distribution
Isn't this wrong? Shouldn't it say that the sample does NOT follow a theoretical distribution? Am I missing something?
This was driving me crazy, so I went to look at the source code directly:
git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala
The code is correct, the null Hypothesis is set as:
object NullHypothesis extends Enumeration {
type NullHypothesis = Value
val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}
The verbiage of the string message is just restating the null hypothesis:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
________________________________________
H0
Arguably the verbiage is confusing as it could be interpreted both ways. But it is indeed correct.
这篇关于Spark (Python) 中的 Kolmogorov Smirnov 测试不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!