代字号和“ by”之间的区别在于:在R中使用聚合函数时 [英] Difference between tilde and "by" while using aggregate function in R

查看:67
本文介绍了代字号和“ by”之间的区别在于:在R中使用聚合函数时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每次我对data.frame进行聚合时,我默认使用 by = list(...) 参数。但是我确实看到了关于stackoverflow以及在公式参数中使用波浪号(〜)的其他地方的解决方案。我有点将 by参数视为围绕这些变量的枢轴。



在某些情况下,输出是完全相同的。例如:

  aggregate(cbind(df $ A,df $ B,df $ C),FUN = sum,by = list( x = df $ D, y = df $ E))



合计(cbind(df $ A,df $ B,df $ C)〜df $ E,FUN = sum)

两者和何时有什么区别

解决方案

我不会完全不同意使用哪种方法并不重要,但是,重要的是要注意它们的行为会有所不同。



我将举一个小例子来说明。



这里有一些样本数据:

  set.seed(1)
mydf<-data.frame(A = c(1 ,1、1、2、2、3、3、3、3、4、4、4),
B =字母[c(1、1、1、1、1、2、2、2, 2,2,2,2)],
矩阵(sample(100,36,replace = TRUE),nrow = 12))
mydf [3:5]<-lapply(mydf [3 :5],function(x){
x [sample(nrow(mydf),1)]<-NA
x
})
mydf
#AB X1 X2 X3
#1 1 A 27 69 27
#2 1 A 38 NA 39
#3 1 A 58 77 2
#4 2 A 91 50 39
#5 2 A 21 72 87
#6 3 B 90100 35
#7 3 B 95 39 49
#8 3 B 67 78 60
#9 3 B 63 94 NA
#10 4 B NA 22 19
#11 4 B 21 66 83
#12 4 B 18 13 67

首先是公式界面。以下三个命令都会产生相同的输出。

  aggregate(cbind(X1,X2,X3)〜A + B, mydf,总和)
合计(cbind(X1,X2,X3)〜。,mydf,总和)
合计(。〜A + B,mydf,总和)
#AB X1 X2 X3
#1 1 A 85146 29
#2 2 A 112122126
#3 3 B 252217144
#4 4 B 39 79150

以下是 by界面的相关命令。键入起来很麻烦(但是可以使用来解决)。

  aggregate(cbind(mydf $ X1,mydf $ X2,mydf $ X3),
by = list(mydf $ A,mydf $ B),sum)
Group.1组。 2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112122126
3 3 B 315311 NA
4 4 B NA 101169

现在,停止并记下任何差异。



两者


  1. 公式方法在保留名称方面做得更好。

    code> 但是但是它不允许您直接在命令中控制名称,而您可以在 data.frame 方法:

      aggregate(cbind(NewX1 = mydf $ X1,NewX2 = mydf $ X2,NewX3 = mydf $ X3),
    by = list(NewA = mydf $ A,NewB = mydf $ B),总和)


  2. 公式方法和 data.frame 方法以不同方式对待 NA 值。若要使用公式方法获得与使用 data.frame 方法相同的结果,需要使用 na.action = na.pass

      aggregate(。〜A + B,mydf,sum,na.action = na.pass )


再次,这并非完全错误说我认为这并不重要,在这里我不会说我的喜好,因为这实际上并不是Stack Overflow的含义,但是在做出此类决定之前,务必仔细阅读功能文档,这一点很重要。 / p>

Every time I do an aggregate on a data.frame I default to using the "by = list(...)" parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.

In some cases, the output is exactly the same. For example:

aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))

AND

aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)

What is the difference between the two and when do you use which?

解决方案

I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.

I'll illustrate with a small example.

Here's some sample data:

set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
                   B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
                   matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
  x[sample(nrow(mydf), 1)] <- NA
  x
})
mydf
#    A B X1  X2 X3
# 1  1 A 27  69 27
# 2  1 A 38  NA 39
# 3  1 A 58  77  2
# 4  2 A 91  50 39
# 5  2 A 21  72 87
# 6  3 B 90 100 35
# 7  3 B 95  39 49
# 8  3 B 67  78 60
# 9  3 B 63  94 NA
# 10 4 B NA  22 19
# 11 4 B 21  66 83
# 12 4 B 18  13 67

First, the formula interface. The following three commands will all yield the same output.

aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
#   A B  X1  X2  X3
# 1 1 A  85 146  29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B  39  79 150

Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with, if required).

aggregate(cbind(mydf$X1, mydf$X2, mydf$X3), 
          by = list(mydf$A, mydf$B), sum)
  Group.1 Group.2  V1  V2  V3
1       1       A 123  NA  68
2       2       A 112 122 126
3       3       B 315 311  NA
4       4       B  NA 101 169

Now, stop and make note of any differences.

The two that pop into my mind are:

  1. The formula method does a nicer job of preserving names but it doesn't let you control the names directly in your command, which you can do in the data.frame method:

    aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3), 
              by = list(NewA = mydf$A, NewB = mydf$B), sum)
    

  2. The formula method and the data.frame method treat NA values differently. To get the same result with the formula method as you do with the data.frame method, you need to use na.action = na.pass.

    aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
    

Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.

这篇关于代字号和“ by”之间的区别在于:在R中使用聚合函数时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆