遍历t.test中r中的数据帧子集 [英] Looping through t.tests for data frame subsets in r
问题描述
我有一个带有32个变量的数据框'math.numeric'.每行代表一个学生,每个变量都是一个属性.根据学生的最终成绩将他们分为5组.
I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.
数据如下:
head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1 1 18 2 1 1 4 4 1 5 1 2
1 1 17 2 1 2 1 1 1 3 1 2
1 1 15 2 2 2 1 1 1 3 3 3
1 1 15 2 1 2 4 2 2 4 2 4
1 1 16 2 1 2 3 3 3 3 2 3
1 2 16 2 2 2 4 3 4 3 4 4
我正在对组1与所有其他组的每个变量进行t检验,以识别该组的显着不同的属性.我希望为每个测试提取p值,例如:
I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:
t.test(subset(math.numeric$school, math.numeric$group == 1),
subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1),
subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1),
subset(math.numeric$age, math.numeric$group != 1))$p.value
我一直试图弄清楚如何创建一个循环来执行此操作,而不是一次写出每个测试.我尝试了一个for循环,但是很幸运,但是到目前为止我还没有碰到任何运气.
I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.
对此我还很陌生,所以我们将不胜感激.
I am fairly new to this, so any help would be appreciated.
美食
推荐答案
您的示例数据不足以实际对所有子组进行t检验.因此,我采用了iris
数据集,该数据集包含3种植物:Setosa,Versicolor和Virginica.这些是我的团体.您将必须相应地调整代码.下面,我展示了如何测试一组相对于所有其他组,一组相对于另一组的测试以及各个组的所有组合.
Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris
dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.
一个组与所有其他组的总和:
首先,假设我想将Versicolor和Virginica与Setosa进行比较,即Setosa是我的group 1
,所有其他组都应与之比较.实现所需目标的一种简单方法如下:
First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1
to which all other groups should be compared. An easy way to achieve what you want is the following:
sapply(names(iris)[-ncol(iris)], function(x){
t.test(iris[iris$Species=="setosa", x],
iris[iris$Species!="setosa", x])$p.value
})
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60
在这里,我提供了数据集names(iris)
中不同变量的名称-排除了指示分组变量[-ncol(iris)]
的列(因为它是最后一列)-作为sapply
的向量,该变量传递了相应的名称作为我定义的函数的参数.
Here, I have supplied the names of the different variables in the dataset names(iris)
- exlcuding the column indicating the grouping variable [-ncol(iris)]
(since it is the last column) - as vector to sapply
, which passes the corresponding names as arguments to the function that I have defined.
一个组与其他每个组:
如果要对所有组进行逐组比较,以下操作可能会有所帮助:首先,创建要进行的所有组x变量组合的数据框,但不包括分组变量本身和参考组.课程.这可以通过以下方式实现:
In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:
comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
names(iris)[-ncol(iris)] # excluding group column
)
head(comps)
Var1 Var2
1 versicolor Sepal.Length
2 virginica Sepal.Length
3 versicolor Sepal.Width
4 virginica Sepal.Width
5 versicolor Petal.Length
6 virginica Petal.Length
在这里,Var1
是不同的种类,Var2
是要进行比较的不同变量.在这种情况下,引用group 1
或Setosa是隐式的.现在,我可以使用apply
来创建测试.我通过使用comps
的每一行作为带有两个元素的参数来执行此操作,其中第一个参数指示轮到哪个组,第二个参数指示应比较哪个变量.这些将用于子集原始数据帧.
Here, Var1
are the different species, and Var2
the different variables for which comparisons are to be done. The reference group 1
or Setosa is implicit in this case. Now, I can use apply
to create the tests. I do this by using each row of comps
as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value
} )
其中第1组(也称为Setosa)在功能中进行了硬编码.这为我提供了所有组合都具有p值的数据框(以Setosa作为参考组),以便于查找:
where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:
head(comps)
Var1 Var2 pval
1 versicolor Sepal.Length 3.746743e-17
2 virginica Sepal.Length 3.966867e-25
3 versicolor Sepal.Width 2.484228e-15
4 virginica Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6 virginica Petal.Length 9.269628e-50
所有组的组合:
您可以轻松地扩展以上内容,以生成一个数据框,其中包含每个组组合的t检验的p值.一种方法是:
You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:
comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])
现在有三列.前两个是组,第三个是变量:
This now has three columns. The first two are the groups, and the third the variable:
head(comps)
Var1 Var2 Var3
1 setosa setosa Sepal.Length
2 versicolor setosa Sepal.Length
3 virginica setosa Sepal.Length
4 setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6 virginica versicolor Sepal.Length
您可以使用它来进行测试:
You can use this to carry out the tests:
comps$pval <- apply(comps, 1, function(x) {
t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value
} )
我收到一条错误消息:该怎么办?
t.test
可能会抛出错误消息.这是有问题的,因为它可能仅在特定的组中发生,并且您可能事先不知道它是哪个组.但是该错误将中断对apply
的整个函数调用,并且您将看不到任何结果.
t.test
may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply
, and you will not be able to see any results.
一种规避此问题并确定有问题的组的方法是将函数t.test
包裹在dplyr::failwith
周围(另请参见?tryCatch
).为了说明它是如何工作的,请考虑以下几点:
A way to circumvent this and to identify the problematic groups is to wrap the function t.test
around dplyr::failwith
(see also ?tryCatch
). To show how this works, consider the following:
smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations
failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"
这样,每当t.test
抛出错误时,您都会得到一个字符作为结果,并且计算将与其他组继续进行.不用说,您也可以将default
设置为数字或其他任何值.它不必是字符.
That way, whenever t.test
would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default
to a number, or anything else. It does not have to be a character.
统计免责声明: 说完所有这些后,请注意进行多次t检验不一定是良好的统计实践.您可能希望调整p值以考虑多次测试,或者您可能希望使用执行联合测试的替代测试过程.
Statistical disclaimer: Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.
这篇关于遍历t.test中r中的数据帧子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!