如何从numpy数组确定什么是概率分布函数? [英] How to determine what is the probability distribution function from a numpy array?

查看:109
本文介绍了如何从numpy数组确定什么是概率分布函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经四处搜寻,但令我惊讶的是,这个问题似乎还没有得到答案.

我有一个Numpy数组,其中包含来自测量的10000个值.我已经用Matplotlib绘制了直方图,通过肉眼观察,这些值似乎呈正态分布:

但是,我想对此进行验证.我发现在 scipy.stats.mstats.normaltest ,但结果不然.我得到以下输出:

(masked_array(data = [1472.8855375088663],
         mask = [False],
   fill_value = 1e+20)
, masked_array(data = [ 0.],
         mask = False,
   fill_value = 1e+20)

)

这意味着数据集呈正态分布的机会为0.我已经重新运行了实验,并再次测试了它们以获得相同的结果,在最佳"情况下,p值为3.0e-290. /p>

我已经使用以下代码测试了该功能,并且似乎可以满足我的要求:

import numpy
import scipy.stats as stats

mu, sigma = 0, 0.1
s = numpy.random.normal(mu, sigma, 10000)

print stats.normaltest(s)

(1.0491016699730547, 0.59182113002186942)

如果我正确理解并正确使用了该函数,则意味着这些值不是正态分布的. (老实说,我不知道为什么输出会有差异,即细节更少.)

我非常确定这是一个正态分布(尽管我对统计数据的了解是基础的),而且我不知道替代方案是什么.我如何检查所讨论的概率分布函数是什么?

我的包含10000个值的Numpy数组是这样生成的(我知道这不是填充Numpy数组的最佳方法),然后运行normaltest:

values = numpy.empty(shape=10000, 1))
for i in range(0, 10000):
    values[i] = measurement(...) # The function returns a float

print normaltest(values)

我刚刚意识到输出之间的差异是因为我无意中使用了两个不同的函数(scipy.stats.normaltest()和scipy.stats.mstats.normaltest()),但这并没有什么区别,因为不论使用什么功能,输出的相关部分都是相同的.

根据Askewchan的建议拟合直方图:

plt.plot(bin_edges, scipy.stats.norm.pdf(bin_edges, loc=values.mean(), scale=values.std()))

结果:

根据用户user333700的建议拟合直方图:

scipy.stats.t.fit(data)

结果:

解决方案

假设您正确使用了测试,我的猜测是您与正态分布存在偏差,并且因为样本量大如此之大,即使很小的偏差也会导致对正态分布零假设的否定.

一种可能性是通过绘制带有大量bin的normed直方图以及带有loc=data.mean()scale=data.std()的pdf来直观地检查您的数据.

还有其他用于检验正态性的检验,当估计分布参数时,statsmodels具有Anderson-Darling和Lillifors(Kolmogorov-Smirnov)检验.

但是,由于样本量较大,我希望结果不会有太大差别.

主要问题是您是否要测试样本是否完全"来自正态分布,还是只对样本是否来自非常接近于正态分布的分布感兴趣?关闭.

要详细说明最后一点:

http://jpktd.blogspot.ca/2012/10/tost-statistically-significant.html http://www.graphpad.com/guides/prism /6/statistics/index.htm?testing_for_equivalence2.htm

随着样本量的增加,假设检验将获得更大的功效,这意味着即使差异越来越小,该检验也将能够拒绝相等性的零假设.如果我们将重要性水平保持不变,那么最终我们将拒绝我们并不真正关心的微小差异.

假设检验的另一种类型是我们要证明我们的样本接近给定的点假设,例如,两个样本的均值几乎相同.问题在于我们必须定义等效区域.

在拟合优度检验的情况下,我们需要选择一个距离度量并为样本与假设分布之间的距离度量定义一个阈值.我还没有找到任何直觉可以帮助选择此距离阈值的解释.

stats.normaltest基于偏斜和峰度与正态分布的偏差.

Anderson-Darling基于cdf之间加权平方差的积分.

Kolmogorov-Smirnov基于CDF之间的最大绝对差.

装箱数据的卡方将基于装箱概率的平方和.

以此类推.

我只尝试对合并或离散化的数据进行等效性测试,在这种情况下,我使用了一些参考案例中的阈值,但仍然相当武断.

在医学等效性测试中,有一些预定义的标准来指定什么时候可以将两种治疗视为同等的,或者类似于单侧版本中的劣等或优等.

I have searched around and to my surprise it seems that this question has not been answered.

I have a Numpy array containing 10000 values from measurements. I have plotted a histogram with Matplotlib, and by visual inspection the values seem to be normally distributed:

However, I would like to validate this. I have found a normality test implemented under scipy.stats.mstats.normaltest, but the result says otherwise. I get this output:

(masked_array(data = [1472.8855375088663],
         mask = [False],
   fill_value = 1e+20)
, masked_array(data = [ 0.],
         mask = False,
   fill_value = 1e+20)

)

which means that the chances that the dataset is normally distributed are 0. I have re-run the experiments and tested them again obtaining the same outcome, and in the "best" case the p value was 3.0e-290.

I have tested the function with the following code and it seems to do what I want:

import numpy
import scipy.stats as stats

mu, sigma = 0, 0.1
s = numpy.random.normal(mu, sigma, 10000)

print stats.normaltest(s)

(1.0491016699730547, 0.59182113002186942)

If I have understood and used the function correctly it means that the values are not normally distributed. (And honestly I have no idea why there is a difference in the output, i.e. less details.)

I was pretty sure that it is a normal distribution (although my knowledge of statistics is basic), and I don't know what could the alternative be. How can I check what is the probability distribution function in question?

EDIT:

My Numpy array containing 10000 values is generated like this (I know that's not the best way to populate a Numpy array), and afterwards the normaltest is run:

values = numpy.empty(shape=10000, 1))
for i in range(0, 10000):
    values[i] = measurement(...) # The function returns a float

print normaltest(values)

EDIT 2:

I have just realised that the discrepancy between the outputs is because I have inadvertently used two different functions (scipy.stats.normaltest() and scipy.stats.mstats.normaltest()), but it does not make a difference since the relevant part of the output is the same regardless of the used function.

EDIT 3:

Fitting the histogram with the suggestion from askewchan:

plt.plot(bin_edges, scipy.stats.norm.pdf(bin_edges, loc=values.mean(), scale=values.std()))

results in this:

EDIT 4:

Fitting the histogram with the suggestion from user user333700:

scipy.stats.t.fit(data)

results in this:

解决方案

Assuming you have used the test correctly, my guess is that you have a small deviation from a normal distribution and because your sample size is so large, even small deviations will lead to a rejection of the null hypothesis of a normal distribution.

One possibility is to visually inspect your data by plotting a normed histogram with a large number of bins and the pdf with loc=data.mean() and scale=data.std().

There are alternative test for testing normality, statsmodels has Anderson-Darling and Lillifors (Kolmogorov-Smirnov) tests when the distribution parameters are estimated.

However, I expect that the results will not differ much given the large sample size.

The main question is whether you want to test whether your sample comes "exactly" from a normal distribution, or whether you are just interested in whether your sample comes from a distribution that is very close to the normal distribution, close in terms of practical usage.

To elaborate on the last point:

http://jpktd.blogspot.ca/2012/10/tost-statistically-significant.html http://www.graphpad.com/guides/prism/6/statistics/index.htm?testing_for_equivalence2.htm

As the sample size increases a hypothesis test gains more power, that means that the test will be able to reject the null hypothesis of equality even for smaller and smaller differences. If we keep our significance level fixed, then eventually we will reject tiny differences that we don't really care about.

An alternative type of hypothesis test is where we want to show that our sample is close to the given point hypothesis, for example two samples have almost the same mean. The problem is that we have to define what our equivalence region is.

In the case of goodness of fit tests we need to choose a distance measure and define a threshold for the distance measure between the sample and the hypothesized distribution. I have not found any explanation where intuition would help to choose this distance threshold.

stats.normaltest is based on deviations of skew and kurtosis from those of the normal distribution.

Anderson-Darling is based on a integral of the weighted squared differences between the cdf.

Kolmogorov-Smirnov is based on the maximum absolute difference between the cdf.

chisquare for binned data would be based on the weighted sum of squared bin probabilities.

and so on.

I only ever tried equivalence testing with binned or discretized data, where I used a threshold from some reference cases which was still rather arbitrary.

In medical equivalence testing there are some predefined standards to specify when two treatments can be considered as equivalent, or similarly as inferior or superior in the one sided version.

这篇关于如何从numpy数组确定什么是概率分布函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆