遍历t.test中r中的数据帧子集 [英] Looping through t.tests for data frame subsets in r

查看:106
本文介绍了遍历t.test中r中的数据帧子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有32个变量的数据框'math.numeric'.每行代表一个学生,每个变量都是一个属性.根据学生的最终成绩将他们分为5组.

I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.

数据如下:

head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1      1   18  2       1       1       4    4    1    5    1          2
1      1   17  2       1       2       1    1    1    3    1          2
1      1   15  2       2       2       1    1    1    3    3          3
1      1   15  2       1       2       4    2    2    4    2          4
1      1   16  2       1       2       3    3    3    3    2          3
1      2   16  2       2       2       4    3    4    3    4          4

我正在对组1与所有其他组的每个变量进行t检验,以识别该组的显着不同的属性.我希望为每个测试提取p值,例如:

I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:

t.test(subset(math.numeric$school, math.numeric$group == 1),
      subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1), 
        subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1), 
        subset(math.numeric$age, math.numeric$group != 1))$p.value

我一直试图弄清楚如何创建一个循环来执行此操作,而不是一次写出每个测试.我尝试了一个for循环,但是很幸运,但是到目前为止我还没有碰到任何运气.

I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.

对此我还很陌生,所以我们将不胜感激.

I am fairly new to this, so any help would be appreciated.

美食

推荐答案

您的示例数据不足以实际对所有子组进行t检验.因此,我采用了iris数据集,该数据集包含3种植物:Setosa,Versicolor和Virginica.这些是我的团体.您将必须相应地调整代码.下面,我展示了如何测试一组相对于所有其他组,一组相对于另一组的测试以及各个组的所有组合.

Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.

一个组与所有其他组的总和:

首先,假设我想将Versicolor和Virginica与Setosa进行比较,即Setosa是我的group 1,所有其他组都应与之比较.实现所需目标的一种简单方法如下:

First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:

sapply(names(iris)[-ncol(iris)], function(x){
             t.test(iris[iris$Species=="setosa", x], 
                    iris[iris$Species!="setosa", x])$p.value 
                    })
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60 

在这里,我提供了数据集names(iris)中不同变量的名称-排除了指示分组变量[-ncol(iris)]的列(因为它是最后一列)-作为sapply的向量,该变量传递了相应的名称作为我定义的函数的参数.

Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.

一个组与其他每个组:

如果要对所有组进行逐组比较,以下操作可能会有所帮助:首先,创建要进行的所有组x变量组合的数据框,但不包括分组变量本身和参考组.课程.这可以通过以下方式实现:

In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:

comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
                     names(iris)[-ncol(iris)] # excluding group column
                     )
head(comps)
        Var1         Var2
1 versicolor Sepal.Length
2  virginica Sepal.Length
3 versicolor  Sepal.Width
4  virginica  Sepal.Width
5 versicolor Petal.Length
6  virginica Petal.Length

在这里,Var1是不同的种类,Var2是要进行比较的不同变量.在这种情况下,引用group 1或Setosa是隐式的.现在,我可以使用apply来创建测试.我通过使用comps的每一行作为带有两个元素的参数来执行此操作,其中第一个参数指示轮到哪个组,第二个参数指示应比较哪个变量.这些将用于子集原始数据帧.

Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.

comps$pval <- apply(comps, 1, function(x) {
    t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value 
    } )

其中第1组(也称为Setosa)在功能中进行了硬编码.这为我提供了所有组合都具有p值的数据框(以Setosa作为参考组),以便于查找:

where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:

head(comps)
        Var1         Var2         pval
1 versicolor Sepal.Length 3.746743e-17
2  virginica Sepal.Length 3.966867e-25
3 versicolor  Sepal.Width 2.484228e-15
4  virginica  Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6  virginica Petal.Length 9.269628e-50

所有组的组合:

您可以轻松地扩展以上内容,以生成一个数据框,其中包含每个组组合的t检验的p值.一种方法是:

You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:

comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])

现在有三列.前两个是组,第三个是变量:

This now has three columns. The first two are the groups, and the third the variable:

head(comps)
        Var1       Var2         Var3
1     setosa     setosa Sepal.Length
2 versicolor     setosa Sepal.Length
3  virginica     setosa Sepal.Length
4     setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6  virginica versicolor Sepal.Length

您可以使用它来进行测试:

You can use this to carry out the tests:

comps$pval <- apply(comps, 1, function(x) {
  t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value 
} )

我收到一条错误消息:该怎么办?

t.test可能会抛出错误消息.这是有问题的,因为它可能仅在特定的组中发生,并且您可能事先不知道它是哪个组.但是该错误将中断对apply的整个函数调用,并且您将看不到任何结果.

t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.

一种规避此问题并确定有问题的组的方法是将函数t.test包裹在dplyr::failwith周围(另请参见?tryCatch).为了说明它是如何工作的,请考虑以下几点:

A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:

smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations

failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"

这样,每当t.test抛出错误时,您都会得到一个字符作为结果,并且计算将与其他组继续进行.不用说,您也可以将default设置为数字或其他任何值.它不必是字符.

That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.

统计免责声明: 说完所有这些后,请注意进行多次t检验不一定是良好的统计实践.您可能希望调整p值以考虑多次测试,或者您可能希望使用执行联合测试的替代测试过程.

Statistical disclaimer: Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.

这篇关于遍历t.test中r中的数据帧子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆