在数据框架中的多个列上使用shapiro.test [英] Using shapiro.test on multiple columns in a data frame

查看:190
本文介绍了在数据框架中的多个列上使用shapiro.test的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这似乎是一个非常简单的问题,但我找不到答案。

It seems like a pretty simple question, but I can't find the answer.

我有一个数据框(让它调用它 df ,包含n = 100列( C1 C2 ,..., code> C100 )和50行( R1 R2 。, R50 )。我测试了数据框中的所有列,以确保它们是数字。我想知道每列中的数据是否使用 shapiro.test()函数正常分配。

I have a dataframe (lets call it df), containing n=100 columns (C1, C2,..., C100) and 50 rows (R1, R2,...,R50). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each columns has a normal distribution using the shapiro.test() function.

我可以使用代码按列执行列:

I am able to do it column by colums using the code :

> shapiro.test(df$Cn)

> shapiro.test(df[,c(Cn)])

但是当我尝试做几列同时不起作用:

However when I try to do it on several columns at the same time it doesn't work :

> shapiro.test(df[,c(C1:C100)])

返回错误: p。

returns the error :

Error in `[.data.frame`(x, complete.cases(x)) : undefined columns selected

如果有人可以建议一种同时进行所有测试的方法,最后将结果存储在新数据框/矩阵/列表/向量。

I would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.

谢谢!

Seb

推荐答案

不是我认为这是一个明智的数据分析方法,而是将功能应用于数据框的列的基本问题是一般性任务可以使用 sapply() lapply()(甚至应用程序(),但是对于数据帧,两个前面提到的函数之一将是最好的)。

Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply() or lapply() (or even apply(), but for data frames, one of the two earlier-mentioned functions would be best).

这里是一个例子,使用一些虚拟数据:

Here is an example, using some dummy data:

set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2), 
                 Uniform = runif(50))

现在应用 shapiro.test()函数。我们在列表中捕获输出(给定该函数返回的对象),所以我们将使用 lapply()

Now apply the shapiro.test() function. We capture the output in a list (given the object returned by this function) so we will use lapply().

lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results

R> lshap[[1]]

    Shapiro-Wilk normality test

data:  X[[1L]]
W = 0.9802, p-value = 0.5611

您将需要从这些对象中提取所需的内容,这些对象都具有以下结构:

You will need to extract the things you want from these objects, which all have the structure:

R> str(lshap[[1]])
List of 4
 $ statistic: Named num 0.98
  ..- attr(*, "names")= chr "W"
 $ p.value  : num 0.561
 $ method   : chr "Shapiro-Wilk normality test"
 $ data.name: chr "X[[1L]]"
 - attr(*, "class")= chr "htest"

如果你想要 code>和 p.value 该对象的组件对于 lshap 的所有元素,我们将使用 sapply()这次,为了很好地安排我们的结果:

If you want the statistic and p.value components of this object for all elements of lshap, we will use sapply() this time, to nicely arrange the results for us:

lres <- sapply(lshap, `[`, c("statistic","p.value"))

R> lres
          Gaussian Poisson Uniform 
statistic 0.9802   0.9371  0.918   
p.value   0.5611   0.01034 0.001998

鉴于您有500个,我会转置 lres

Given that you have 500 of these, I'd transpose lres:

R> t(lres)
         statistic p.value 
Gaussian 0.9802    0.5611  
Poisson  0.9371    0.01034 
Uniform  0.918     0.001998

如果您打算从本练习中执行 p - 值的任何操作,我建议您在拍摄自己之前开始考虑如何纠正多次比较在30公里的脚下。

If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.

这篇关于在数据框架中的多个列上使用shapiro.test的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆