代字号和“ by”之间的区别在于:在R中使用聚合函数时 [英] Difference between tilde and "by" while using aggregate function in R
问题描述
每次我对data.frame进行聚合时,我默认使用 by = list(...)
参数。但是我确实看到了关于stackoverflow以及在公式参数中使用波浪号(〜)的其他地方的解决方案。我有点将 by参数视为围绕这些变量的枢轴。
在某些情况下,输出是完全相同的。例如:
aggregate(cbind(df $ A,df $ B,df $ C),FUN = sum,by = list( x = df $ D, y = df $ E))
和
合计(cbind(df $ A,df $ B,df $ C)〜df $ E,FUN = sum)
两者和何时有什么区别
我不会完全不同意使用哪种方法并不重要,但是,重要的是要注意它们的行为会有所不同。
我将举一个小例子来说明。
这里有一些样本数据:
set.seed(1)
mydf<-data.frame(A = c(1 ,1、1、2、2、3、3、3、3、4、4、4),
B =字母[c(1、1、1、1、1、2、2、2, 2,2,2,2)],
矩阵(sample(100,36,replace = TRUE),nrow = 12))
mydf [3:5]<-lapply(mydf [3 :5],function(x){
x [sample(nrow(mydf),1)]<-NA
x
})
mydf
#AB X1 X2 X3
#1 1 A 27 69 27
#2 1 A 38 NA 39
#3 1 A 58 77 2
#4 2 A 91 50 39
#5 2 A 21 72 87
#6 3 B 90100 35
#7 3 B 95 39 49
#8 3 B 67 78 60
#9 3 B 63 94 NA
#10 4 B NA 22 19
#11 4 B 21 66 83
#12 4 B 18 13 67
首先是公式界面。以下三个命令都会产生相同的输出。
aggregate(cbind(X1,X2,X3)〜A + B, mydf,总和)
合计(cbind(X1,X2,X3)〜。,mydf,总和)
合计(。〜A + B,mydf,总和)
#AB X1 X2 X3
#1 1 A 85146 29
#2 2 A 112122126
#3 3 B 252217144
#4 4 B 39 79150
以下是 by界面的相关命令。键入起来很麻烦(但是可以使用和
来解决)。
aggregate(cbind(mydf $ X1,mydf $ X2,mydf $ X3),
by = list(mydf $ A,mydf $ B),sum)
Group.1组。 2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112122126
3 3 B 315311 NA
4 4 B NA 101169
现在,停止并记下任何差异。
两者
-
公式方法在保留
code> 但是但是它不允许您直接在命令中控制名称,而您可以在名称方面做得更好。
data.frame
方法:aggregate(cbind(NewX1 = mydf $ X1,NewX2 = mydf $ X2,NewX3 = mydf $ X3),
by = list(NewA = mydf $ A,NewB = mydf $ B),总和)
-
公式方法和
data.frame
方法以不同方式对待NA
值。若要使用公式方法获得与使用data.frame
方法相同的结果,需要使用na.action = na.pass
。aggregate(。〜A + B,mydf,sum,na.action = na.pass )
再次,这并非完全错误说我认为这并不重要,在这里我不会说我的喜好,因为这实际上并不是Stack Overflow的含义,但是在做出此类决定之前,务必仔细阅读功能文档,这一点很重要。 / p>
Every time I do an aggregate on a data.frame I default to using the "by = list(...)"
parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.
In some cases, the output is exactly the same. For example:
aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))
AND
aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)
What is the difference between the two and when do you use which?
I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.
I'll illustrate with a small example.
Here's some sample data:
set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
x[sample(nrow(mydf), 1)] <- NA
x
})
mydf
# A B X1 X2 X3
# 1 1 A 27 69 27
# 2 1 A 38 NA 39
# 3 1 A 58 77 2
# 4 2 A 91 50 39
# 5 2 A 21 72 87
# 6 3 B 90 100 35
# 7 3 B 95 39 49
# 8 3 B 67 78 60
# 9 3 B 63 94 NA
# 10 4 B NA 22 19
# 11 4 B 21 66 83
# 12 4 B 18 13 67
First, the formula interface. The following three commands will all yield the same output.
aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
# A B X1 X2 X3
# 1 1 A 85 146 29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B 39 79 150
Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with
, if required).
aggregate(cbind(mydf$X1, mydf$X2, mydf$X3),
by = list(mydf$A, mydf$B), sum)
Group.1 Group.2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112 122 126
3 3 B 315 311 NA
4 4 B NA 101 169
Now, stop and make note of any differences.
The two that pop into my mind are:
The formula method does a nicer job of preserving
names
but it doesn't let you control the names directly in your command, which you can do in thedata.frame
method:aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3), by = list(NewA = mydf$A, NewB = mydf$B), sum)
The formula method and the
data.frame
method treatNA
values differently. To get the same result with the formula method as you do with thedata.frame
method, you need to usena.action = na.pass
.aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.
这篇关于代字号和“ by”之间的区别在于:在R中使用聚合函数时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!