使用字符串向量输入在 dplyr 中按多列分组 [英] Group by multiple columns in dplyr, using string vector input
问题描述
我正在尝试将我对 plyr 的理解转移到 dplyr,但我不知道如何按多列进行分组.
I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
将 plyr 示例转换为 dplyr-esque 语法我缺少什么?
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017:Dplyr 已更新,因此可以使用更简单的解决方案.查看当前选择的答案.
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
推荐答案
自从发布这个问题后,dplyr 添加了 group_by
(这里的文档).这让您可以使用与 select
相同的功能,如下所示:
Since this question was posted, dplyr added scoped versions of group_by
(documentation here). This lets you use the same functions you would use with select
, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
您的示例问题的输出符合预期(请参阅上面与 plyr 的比较和下面的输出):
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
请注意,由于 dplyr::summarize
一次只剥离一层分组,因此您仍然在生成的 tibble 中进行了一些分组(有时可能会在稍后引起人们的惊讶下线).如果您想绝对避免意外的分组行为,您可以在汇总后始终将 %>% ungroup
添加到管道中.
Note that since dplyr::summarize
only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup
to your pipeline after you summarize.
这篇关于使用字符串向量输入在 dplyr 中按多列分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!