循环遍历 r 中数据帧子集的 t.tests [英] Looping through t.tests for data frame subsets in r

查看:23
本文介绍了循环遍历 r 中数据帧子集的 t.tests的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 32 个变量的数据框math.numeric".每行代表一个学生,每个变量是一个属性.学生们根据他们的最终成绩被分成了 5 个小组.

I have a data frame 'math.numeric' with 32 variables. Each row represents a student and each variable is an attribute. The students have been put into 5 groups based on their final grade.

数据如下:

head(math.numeric)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason ... group
1      1   18  2       1       1       4    4    1    5    1          2
1      1   17  2       1       2       1    1    1    3    1          2
1      1   15  2       2       2       1    1    1    3    3          3
1      1   15  2       1       2       4    2    2    4    2          4
1      1   16  2       1       2       3    3    3    3    2          3
1      2   16  2       2       2       4    3    4    3    4          4

我正在对第 1 组与所有其他组的每个变量进行 t 检验,以确定与该组显着不同的属性.我希望为每个测试提取 p 值,例如:

I am performing t-tests on each variable for group 1 vs. all the other groups to identify significantly different attributes with this group. I am looking to pull out the p-values for each test such as:

t.test(subset(math.numeric$school, math.numeric$group == 1),
      subset(math.numeric$school, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$sex, math.numeric$group == 1), 
        subset(math.numeric$sex, math.numeric$group != 1))$p.value
t.test(subset(math.numeric$age, math.numeric$group == 1), 
        subset(math.numeric$age, math.numeric$group != 1))$p.value

我一直试图弄清楚如何创建一个循环来执行此操作,而不是一次写出每个测试.我尝试过 for 循环和 lapply,但到目前为止我还没有任何运气.

I have been trying to figure out how I can create a loop to do this instead of writing out each test one at a time. I have tried a for loop, and lapply, but so far I have not had any luck.

我对此很陌生,因此将不胜感激.

I am fairly new to this, so any help would be appreciated.

考特尼

推荐答案

您的示例数据不足以实际对所有子组进行 t 检验.出于这个原因,我采用了 iris 数据集,其中包含 3 种植物:Setosa、Versicolor 和 Virginica.这些是我的团体.您将不得不相应地调整您的代码.下面我将展示如何测试一组与所有其他组、一组与另一组以及各个组的所有组合.

Your example data is not sufficient to actually carry out t-tests on all subgroups. For that reason, I take the iris dataset, which contains 3 species of plants: Setosa, Versicolor, and Virginica. These are my groups. You will have to adjust your code accordingly. Below I show how to test one groups versus all other groups, one group versus each other group, and all combinations of individual groups.

一组与所有其他组的组合:

首先,假设我想将 Versicolor 和 Virginica 与 Setosa 进行比较,即 Setosa 是我的 group 1,所有其他组都应该与之进行比较.实现您想要的一种简单方法如下:

First, let's say I want to compare Versicolor and Virginica to Setosa, i.e. Setosa is my group 1 to which all other groups should be compared. An easy way to achieve what you want is the following:

sapply(names(iris)[-ncol(iris)], function(x){
             t.test(iris[iris$Species=="setosa", x], 
                    iris[iris$Species!="setosa", x])$p.value 
                    })
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
7.709331e-32 1.035396e-13 1.746188e-69 1.347804e-60 

在这里,我提供了数据集中不同变量的名称names(iris) - 排除指示分组变量的列[-ncol(iris)](因为它是最后一列) - 作为 sapply 的向量,它将相应的名称作为参数传递给我定义的函数.

Here, I have supplied the names of the different variables in the dataset names(iris) - exlcuding the column indicating the grouping variable [-ncol(iris)] (since it is the last column) - as vector to sapply, which passes the corresponding names as arguments to the function that I have defined.

一组与其他各组对比:

如果您想对所有组进行分组比较,以下可能会有所帮助:首先,创建您将要执行的所有组 x 变量组合的数据框,不包括分组变量本身和参考组,课程.这可以通过以下方式实现:

In case you want to make groupwise comparisons for all groups, the following may be helpful: First, create a dataframe of all group x variable combinations that you are going to do, excluding the grouping variable itself and the reference group, of course. This can be achieved by:

comps <- expand.grid(unique(iris$Species)[-1], # excluding Setosa as reference group
                     names(iris)[-ncol(iris)] # excluding group column
                     )
head(comps)
        Var1         Var2
1 versicolor Sepal.Length
2  virginica Sepal.Length
3 versicolor  Sepal.Width
4  virginica  Sepal.Width
5 versicolor Petal.Length
6  virginica Petal.Length

这里,Var1 是不同的物种,Var2 是要进​​行比较的不同变量.在这种情况下,引用 group 1 或 Setosa 是隐式的.现在,我可以使用 apply 来创建测试.我通过使用 comps 的每一行作为带有两个元素的参数来做到这一点,其中第一个指示轮到哪个组,第二个参数指示应该比较哪个变量.这些将用于对原始数据帧进行子集化.

Here, Var1 are the different species, and Var2 the different variables for which comparisons are to be done. The reference group 1 or Setosa is implicit in this case. Now, I can use apply to create the tests. I do this by using each row of comps as argument with two elements, the first of which indicates which group's turn it is, and the second argument indicates which variable should be compared. These will be used to subset the original dataframe.

comps$pval <- apply(comps, 1, function(x) {
    t.test(iris[iris$Species=="setosa", x[2]], iris[iris$Species==x[1], x[2]])$p.value 
    } )

其中第 1 组又名 Setosa 是在函数中硬编码的.这为我提供了一个包含所有组合 p 值的数据框(以 Setosa 作为参考组),以便它们易于查找:

where group 1 aka Setosa is hard-coded in the function. This gives me a dataframe with p-values for all combinations (with Setosa as reference group) so that they are easy to look up:

head(comps)
        Var1         Var2         pval
1 versicolor Sepal.Length 3.746743e-17
2  virginica Sepal.Length 3.966867e-25
3 versicolor  Sepal.Width 2.484228e-15
4  virginica  Sepal.Width 4.570771e-09
5 versicolor Petal.Length 9.934433e-46
6  virginica Petal.Length 9.269628e-50

组的所有组合:

您可以轻松扩展上述内容以生成包含每个组组合的 t 检验的 p 值的数据框.一种方法是:

You can expand the above easily to produce a dataframe that contains p-values of t-tests for each combination of groups. One approach would be:

comps <- expand.grid(unique(iris$Species), unique(iris$Species), names(iris)[-ncol(iris)])

现在有三列.前两个是组,第三个是变量:

This now has three columns. The first two are the groups, and the third the variable:

head(comps)
        Var1       Var2         Var3
1     setosa     setosa Sepal.Length
2 versicolor     setosa Sepal.Length
3  virginica     setosa Sepal.Length
4     setosa versicolor Sepal.Length
5 versicolor versicolor Sepal.Length
6  virginica versicolor Sepal.Length

您可以使用它来执行测试:

You can use this to carry out the tests:

comps$pval <- apply(comps, 1, function(x) {
  t.test(iris[iris$Species==x[1], x[3]], iris[iris$Species==x[2], x[3]])$p.value 
} )

我收到一条错误消息:我该怎么办?

t.test 如果样本量太小或者一组的值是恒定的,则可能会抛出错误消息.这是有问题的,因为它可能只发生在特定的群体中,而且您可能事先不知道它是哪个群体.但是该错误会中断对 apply 的整个函数调用,并且您将看不到任何结果.

t.test may throw out an error message if the sample size is too small or if the values are constant for one group. This is problematic since it might only occur for specific groups, and you may not know in advance which one it is. Yet the error will disrupt the entire function call to apply, and you will not be able to see any results.

规避这一点并识别有问题的组的一种方法是将函数 t.test 包裹在 dplyr::failwith 周围(另请参见 ?tryCatch).要展示其工作原理,请考虑以下事项:

A way to circumvent this and to identify the problematic groups is to wrap the function t.test around dplyr::failwith (see also ?tryCatch). To show how this works, consider the following:

smalln <- data.frame(a=1, b=2)
t.test(smalln$a, smalln$b)
> Error in t.test.default(smalln$a, smalln$b) : not enough 'x' observations

failproof.t <- failwith(default="Some default of your liking", t.test, quiet = T)
failproof.t(smalln$a, smalln$b)
[1] "Some default of your liking"

这样,每当 t.test 抛出错误时,您都会得到一个字符作为结果,并且计算会继续与其他组进行.不用说,您还可以将 default 设置为数字或其他任何内容.它不必是一个字符.

That way, whenever t.test would throw out an error, you get a character as a result instead and the computation continues with other groups. Needless to say, you could also set default to a number, or anything else. It does not have to be a character.

统计免责声明:说了这么多,请注意,进行多次 t 检验不一定是好的统计实践.您可能想要调整您的 p 值以考虑多重测试,或者您可能想要使用进行联合测试的替代测试程序.

Statistical disclaimer: Having said all of this, note that conducting a several t-tests is not necessarily good statistical practice. You may want to adjust your p-values to account for multiple testing, or you may want to use alternative test procedures that conducts joint tests.

这篇关于循环遍历 r 中数据帧子集的 t.tests的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆