R dplyr的group_by也考虑空组 [英] R dplyr's group_by consider empty groups as well

查看:50
本文介绍了R dplyr的group_by也考虑空组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们考虑以下数据框:

  set.seed(123)数据<-data.frame(col1 = factor(rep(c("A","B","C"),4)),col2 = factor(c(rep(c("A","B","C"),3),c("A","A","A")))),val1 = 1:12,val2 = rnorm(12,10,15)) 

列联表如下:

  cont_tab<-表(data $ col1,data $ col2,dnn = c("col1","col2"))cont_tabcol2col1 A B CA 4 0 0B 1 3 0C 1 0 3 

如您所见,未发生某些配对:(A,B),(A,C),(B,C),(C,B).我分析的最终目标是列出所有对(在本例中为9)并显示每个对的统计信息.使用 dplyr :: group_by()函数时,我遇到了一个限制.即, dplyr :: group_by()仅考虑现有对(至少发生过一次的对):

 数据%>%group_by(col1,col2)%>%summary(stat = sum(val2)-sum(val1))#小动作:5 x 3#个群组:col1 [?]col1 col2统计< fct>< fct>< dbl>1安A 58.12 B A -16.43 B B 17.04 C -12.95 C C -41.9 

我想到的输出有9行(其中4行的 stat 等于0).可以在 dplyr 中使用吗?

很抱歉一开始太含糊.真正的问题比计算特定对的发生次数更为复杂.我添加了新数据,以使实际问题更加明显.

解决方案

tidyr 添加 spread 来获得与相同的结果要容易得多.表格

 库(dplyr)图书馆(tidyr)count(data,col1,col2)%>%传播(col2,n,填充= 0)#小动作:3 x 4#个群组:col1 [3]#col1 A B C#< fct>< dbl>< dbl>< dbl>#1 A 4 0 0#2 B 1 3 0#3 C 1 0 3 

注意: group_by/summarise 步骤在此处更改为 count

如@divibisan所建议,如果OP需要长格式,则在末尾添加 gather

 数据%>%group_by(col1,col2)%>%summary(stat = n())%&%价差(col2,统计,填充= 0)%>%收集(col2,stat,A:C)#小动作:9 x 3#个群组:col1 [3]#col1 col2统计#< fct>< chr>< dbl>#1 A A 4#2 B A 1#3 C A 1#4 A B 0#5 B B 3#6 C B 0#7 A C 0#8 B C 0#9 C C 3 

更新

在OP的帖子中有更新的数据

 数据%>%group_by(col1,col2)%>%summary(stat = sum(val2)-sum(val1))%&%价差(col2,统计,填充= 0)%>%收集(col2,stat,-1)#小动作:9 x 3#个群组:col1 [3]#col1 col2统计#< fct>< chr>< dbl>#1 A A 7.76#2 B A -20.8#3 C A 6.97#4 A B 0#5 B B 28.8#6 C B 0#7 A C 0#8 B C 0#9 C C 9.56 

Let's consider the following data frame:

set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
                   col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
                   val1 = 1:12,
                   val2 = rnorm(12, 10, 15))

The contingency table is as follows:

cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))

cont_tab

    col2
col1 A B C
   A 4 0 0
   B 1 3 0
   C 1 0 3

As you can see some pairs didn't occur: (A,B), (A,C), (B,C), (C,B). The end goal of my analysis is to list all of the pairs (in this case 9) and show a statistic for each of them. While using dplyr::group_by() function I hit a limitation. Namely, the dplyr::group_by() considers only existing pairs (pairs that occured at least once):

data %>%
  group_by(col1, col2) %>%
  summarize(stat = sum(val2) - sum(val1))

# A tibble: 5 x 3
# Groups:   col1 [?]
  col1  col2   stat
  <fct> <fct> <dbl>
1 A     A      58.1
2 B     A     -16.4
3 B     B      17.0
4 C     A     -12.9
5 C     C     -41.9

The output I have in mind has 9 rows (4 of which has stat equal to 0). Is it doable in dplyr?

EDIT: Sorry for being too vague at the beginning. The real problem is more complex than counting the number of times a particular pair occurs. I added the new data in order to make the real problem more visible.

解决方案

It is much easier to add spread from tidyr to get the same result as with table

library(dplyr)
library(tidyr)
count(data, col1, col2) %>% 
      spread(col2, n, fill = 0)
# A tibble: 3 x 4
# Groups:   col1 [3]
#  col1      A     B     C
#  <fct> <dbl> <dbl> <dbl>
#1 A         4     0     0
#2 B         1     3     0
#3 C         1     0     3

NOTE: The group_by/summarise step is changed to count here

As @divibisan suggested, if the OP wanted long format, then add gather at the end

data %>%
   group_by(col1, col2) %>%
   summarize(stat = n()) %>%
   spread(col2, stat, fill = 0) %>%
   gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2   stat
#  <fct> <chr> <dbl>
#1 A     A         4
#2 B     A         1
#3 C     A         1
#4 A     B         0
#5 B     B         3
#6 C     B         0
#7 A     C         0
#8 B     C         0
#9 C     C         3

Update

With the updated data in OP's post

data %>%
   group_by(col1, col2) %>%
   summarize(stat = sum(val2) - sum(val1)) %>% 
   spread(col2, stat, fill = 0)  %>% 
   gather(col2, stat, -1)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2    stat
#  <fct> <chr>  <dbl>
#1 A     A       7.76
#2 B     A     -20.8 
#3 C     A       6.97
#4 A     B       0   
#5 B     B      28.8 
#6 C     B       0   
#7 A     C       0   
#8 B     C       0   
#9 C     C       9.56

这篇关于R dplyr的group_by也考虑空组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆