执行 Shapiro-Wilk 正态性检验 [英] Perform a Shapiro-Wilk Normality Test

查看:161
本文介绍了执行 Shapiro-Wilk 正态性检验的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想执行 Shapiro-Wilk 正态性测试.我的数据是 csv 格式.它看起来像这样:

 海森堡硬件更新1 -15.602 -21.603 -19.504 -19.105 -20.906 -20.707 -19.308 -18.309 -15.10

但是,当我执行测试时,我得到:

 shapiro.test(heisenberg)

<块引用>

[.data.frame(x, complete.cases(x)) 中的错误:选择了未定义的列

为什么 R 不选择正确的列,我该怎么做?

解决方案

shapiro.test 有什么作用?

shapiro.test 测试样本来自正态分布"的零假设反对替代假设em>样本不是来自正态分布".

如何在 R 中执行 shapiro.test?

?shapiro.test 的 R 帮助页面给出了,

x - 数据值的数字向量.允许缺失值,但非缺失值的数量必须在 3 到 5000 之间.

也就是说,shapiro.test 需要一个 数字向量 作为输入,它对应于您要测试的样本,并且它是唯一需要的输入.由于您有一个 data.frame,您必须将所需的列作为输入传递给函数,如下所示:

<代码>>shapiro.test(海森堡$HWWIchg)# Shapiro-Wilk 正态性检验# 数据:海森堡$HWWIchg# W = 0.9001,p 值 = 0.2528

解释 shapiro.test 的结果:

首先,我强烈建议您阅读 Ian Fellows 关于 正态性测试 的精彩回答.

如上所示,shapiro.test 测试样本来自正态分布的 NULL 假设.这意味着如果您的 p 值 <= 0.05,那么您将拒绝样本来自正态分布的 NULL 假设.正如 Ian Fellows 所说的那样,您正在测试反对正态性假设".换句话说(如果我错了,请纠正我),这将很多更好如果测试样本来自正态分布的NULL假设.为什么?因为,拒绝NULL假设与接受不一样备择假设.

shapiro.test 的原假设的情况下,p 值 <= 0.05 将拒绝样本来自正态分布的原假设.简而言之,样本来自正态分布的几率.这种假设检验的副作用是这种罕见的机会很少发生.举例说明:

set.seed(450)x <- runif(50, min=2, max=4)shapiro.test(x)# Shapiro-Wilk 正态性检验# 数据:runif(50, min = 2, max = 4)# W = 0.9601,p 值 = 0.08995

因此,根据此测试,此(特定)样本 runif(50, min=2, max=4) 来自正态分布.我想说的是,在很多情况下,极端"要求(p <0.05)没有得到满足,这导致大多数时候接受零假设",这可能会产生误导.

另一个问题,我想从 @PaulHiemstra 中引用关于对大样本量的影响的评论:

<块引用>

Shapiro-Wilk 检验的另一个问题是,当您向其提供更多数据时,原假设被拒绝的可能性会变得更大.所以发生的情况是,对于大量数据,甚至可以检测到与正态性的非常小的偏差,从而导致拒绝零假设事件,尽管就实际目的而言,数据已经足够正常了.

尽管他也指出 R 的数据大小限制对此有所保护:

<块引用>

幸运的是 shapiro.test 通过将数据大小限制为 5000 来保护用户免受上述影响.

如果 NULL 假设是相反的,这意味着样本来自正态分布,你会得到一个 p 值 0.05,然后您得出结论,这些样本来自正态分布(拒绝 NULL 假设)的情况非常.这大致可以理解为:样本很可能是正态分布的(尽管有些统计学家可能不喜欢这种解释方式).我相信这也是 Ian Fellows 在他的帖子中试图解释的.如果我做错了什么,请纠正我!

@PaulHiemstra 还评论了当人们遇到这个正态性测试问题时的实际情况(例如回归):

<块引用>

在实践中,如果分析假设为正态,例如lm,我不会做这个 Shapiro-Wilk 的测试,而是做分析并查看分析结果的诊断图,以判断分析的任何假设是否违反太多.对于使用 lm 的线性回归,这是通过查看使用 plot(lm()) 获得的一些诊断图来完成的.统计并不是一连串得出几个数字的步骤(嘿,p <0.05!),而是需要大量经验和技巧来判断如何正确分析数据.

在这里,我发现 Ian Fellows 在同一问题下对 Ben Bolker 的评论的回复同样(如果不是更多)信息丰富:

<块引用>

对于线性回归,

  1. 不要太担心常态.CLT 接管很快,如果您拥有除最小样本量之外的所有数据以及看起来甚至远为合理的直方图,您就可以了.

  2. 担心不等方差(异方差).我担心这个(几乎)默认使用 HCCM 测试.比例位置图会给出一些关于这是否被破坏的想法,但并非总是如此.此外,在大多数情况下没有先验的理由假设方差相等.

  3. 异常值.烹饪距离 > 1 是值得关注的合理原因.

这些是我的想法(FWIW).

希望这能让事情变得更清楚.

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:

 heisenberg
    HWWIchg
1    -15.60
2    -21.60
3    -19.50
4    -19.10
5    -20.90
6    -20.70
7    -19.30
8    -18.30
9    -15.10

However, when I perform the test, I get:

 shapiro.test(heisenberg)

Error in [.data.frame(x, complete.cases(x)) : undefined columns selected

Why isnt`t R selecting the right column and how do I do that?

解决方案

What does shapiro.test do?

shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".

How to perform shapiro.test in R?

The R help page for ?shapiro.test gives,

x - a numeric vector of data values. Missing values are allowed, 
    but the number of non-missing values must be between 3 and 5000.

That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:

> shapiro.test(heisenberg$HWWIchg)
#   Shapiro-Wilk normality test

# data:  heisenberg$HWWIchg 
# W = 0.9001, p-value = 0.2528

Interpreting results from shapiro.test:

First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.

As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.

In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:

set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
#   Shapiro-Wilk normality test
# data:  runif(50, min = 2, max = 4) 
# W = 0.9601, p-value = 0.08995

So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.

Another issue I'd like to quote here from @PaulHiemstra from under comments about the effects on large sample size:

An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.

Although he also points out that R's data size limit protects this a bit:

Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.

If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!

@PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:

In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.

Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:

For linear regression,

  1. Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.

  2. Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.

  3. Outliers. A cooks distance of > 1 is reasonable cause for concern.

Those are my thoughts (FWIW).

Hope this clears things up a bit.

这篇关于执行 Shapiro-Wilk 正态性检验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆