在dplyr中按多个列进行分组,使用字符串向量输入 [英] Group by multiple columns in dplyr, using string vector input

查看:248
本文介绍了在dplyr中按多个列进行分组,使用字符串向量输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将我对plyr的理解转移到dplyr中,但是我无法确定如何按多个列进行分组。

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.

# make data with weird column names that can't be hard coded
data = data.frame(
  asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

# plyr - works
ddply(data, columns, summarize, value=mean(value))

# dplyr - raises error
data %.%
  group_by(columns) %.%
  summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds

plyr示例转换成dplyr-esque语法?

What am I missing to translate the plyr example into a dplyr-esque syntax?

编辑2017 :Dplyr已更新,因此可以使用更简单的解决方案。

Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.

推荐答案

由于此问题已发布,dplyr添加了 group_by 此处的文档) 。这样,您可以使用与选择一样使用的功能,如下所示:

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(varnames))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27 

您的示例问题的输出是如预期的(参见上面的plyr和下面的输出的比较):

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

请注意,由于 dplyr :: summarize 一次只能剥离一个分组,您仍然有一些分组进行到最终的tibble(这可能有时会通过惊讶赶上人们在线)。如果您希望绝对安全避免意外的分组行为,您可以在总结后随时添加%>%ungroup 到您的管道。

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

这篇关于在dplyr中按多个列进行分组,使用字符串向量输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆