在数据框架中的多个列上使用shapiro.test [英] Using shapiro.test on multiple columns in a data frame
问题描述
这似乎是一个非常简单的问题,但我找不到答案。
It seems like a pretty simple question, but I can't find the answer.
我有一个数据框(让它调用它 df
,包含n = 100列( C1
, C2
,..., code> C100 )和50行( R1
, R2
。, R50
)。我测试了数据框中的所有列,以确保它们是数字。我想知道每列中的数据是否使用 shapiro.test()
函数正常分配。
I have a dataframe (lets call it df
), containing n=100 columns (C1
, C2
,..., C100
) and 50 rows (R1
, R2
,...,R50
). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each columns has a normal distribution using the shapiro.test()
function.
我可以使用代码按列执行列:
I am able to do it column by colums using the code :
> shapiro.test(df$Cn)
或
> shapiro.test(df[,c(Cn)])
但是当我尝试做几列同时不起作用:
However when I try to do it on several columns at the same time it doesn't work :
> shapiro.test(df[,c(C1:C100)])
返回错误: p。
returns the error :
Error in `[.data.frame`(x, complete.cases(x)) : undefined columns selected
如果有人可以建议一种同时进行所有测试的方法,最后将结果存储在新数据框/矩阵/列表/向量。
I would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.
谢谢!
Seb
推荐答案
不是我认为这是一个明智的数据分析方法,而是将功能应用于数据框的列的基本问题是一般性任务可以使用 sapply()
或 lapply()
(甚至应用程序()
,但是对于数据帧,两个前面提到的函数之一将是最好的)。
Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply()
or lapply()
(or even apply()
, but for data frames, one of the two earlier-mentioned functions would be best).
这里是一个例子,使用一些虚拟数据:
Here is an example, using some dummy data:
set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2),
Uniform = runif(50))
现在应用 shapiro.test()
函数。我们在列表中捕获输出(给定该函数返回的对象),所以我们将使用 lapply()
。
Now apply the shapiro.test()
function. We capture the output in a list (given the object returned by this function) so we will use lapply()
.
lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results
R> lshap[[1]]
Shapiro-Wilk normality test
data: X[[1L]]
W = 0.9802, p-value = 0.5611
您将需要从这些对象中提取所需的内容,这些对象都具有以下结构:
You will need to extract the things you want from these objects, which all have the structure:
R> str(lshap[[1]])
List of 4
$ statistic: Named num 0.98
..- attr(*, "names")= chr "W"
$ p.value : num 0.561
$ method : chr "Shapiro-Wilk normality test"
$ data.name: chr "X[[1L]]"
- attr(*, "class")= chr "htest"
如果你想要 code>和
p.value
该对象的组件对于 lshap
的所有元素,我们将使用 sapply()
这次,为了很好地安排我们的结果:
If you want the statistic
and p.value
components of this object for all elements of lshap
, we will use sapply()
this time, to nicely arrange the results for us:
lres <- sapply(lshap, `[`, c("statistic","p.value"))
R> lres
Gaussian Poisson Uniform
statistic 0.9802 0.9371 0.918
p.value 0.5611 0.01034 0.001998
鉴于您有500个,我会转置 lres
:
Given that you have 500 of these, I'd transpose lres
:
R> t(lres)
statistic p.value
Gaussian 0.9802 0.5611
Poisson 0.9371 0.01034
Uniform 0.918 0.001998
如果您打算从本练习中执行 p - 值的任何操作,我建议您在拍摄自己之前开始考虑如何纠正多次比较在30公里的脚下。
If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.
这篇关于在数据框架中的多个列上使用shapiro.test的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!