R dplyr的group_by也考虑空组 [英] R dplyr's group_by consider empty groups as well
问题描述
让我们考虑以下数据框:
set.seed(123)数据<-data.frame(col1 = factor(rep(c("A","B","C"),4)),col2 = factor(c(rep(c("A","B","C"),3),c("A","A","A")))),val1 = 1:12,val2 = rnorm(12,10,15))
列联表如下:
cont_tab<-表(data $ col1,data $ col2,dnn = c("col1","col2"))cont_tabcol2col1 A B CA 4 0 0B 1 3 0C 1 0 3
如您所见,未发生某些配对:(A,B),(A,C),(B,C),(C,B).我分析的最终目标是列出所有对(在本例中为9)并显示每个对的统计信息.使用 dplyr :: group_by()
函数时,我遇到了一个限制.即, dplyr :: group_by()
仅考虑现有对(至少发生过一次的对):
数据%>%group_by(col1,col2)%>%summary(stat = sum(val2)-sum(val1))#小动作:5 x 3#个群组:col1 [?]col1 col2统计< fct>< fct>< dbl>1安A 58.12 B A -16.43 B B 17.04 C -12.95 C C -41.9
我想到的输出有9行(其中4行的 stat
等于0).可以在 dplyr
中使用吗?
很抱歉一开始太含糊.真正的问题比计算特定对的发生次数更为复杂.我添加了新数据,以使实际问题更加明显.
从 tidyr
添加 spread
来获得与相同的结果要容易得多.表格
库(dplyr)图书馆(tidyr)count(data,col1,col2)%>%传播(col2,n,填充= 0)#小动作:3 x 4#个群组:col1 [3]#col1 A B C#< fct>< dbl>< dbl>< dbl>#1 A 4 0 0#2 B 1 3 0#3 C 1 0 3
注意: group_by/summarise
步骤在此处更改为 count
如@divibisan所建议,如果OP需要长格式,则在末尾添加 gather
数据%>%group_by(col1,col2)%>%summary(stat = n())%&%价差(col2,统计,填充= 0)%>%收集(col2,stat,A:C)#小动作:9 x 3#个群组:col1 [3]#col1 col2统计#< fct>< chr>< dbl>#1 A A 4#2 B A 1#3 C A 1#4 A B 0#5 B B 3#6 C B 0#7 A C 0#8 B C 0#9 C C 3
更新
在OP的帖子中有更新的数据
数据%>%group_by(col1,col2)%>%summary(stat = sum(val2)-sum(val1))%&%价差(col2,统计,填充= 0)%>%收集(col2,stat,-1)#小动作:9 x 3#个群组:col1 [3]#col1 col2统计#< fct>< chr>< dbl>#1 A A 7.76#2 B A -20.8#3 C A 6.97#4 A B 0#5 B B 28.8#6 C B 0#7 A C 0#8 B C 0#9 C C 9.56
Let's consider the following data frame:
set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
val1 = 1:12,
val2 = rnorm(12, 10, 15))
The contingency table is as follows:
cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))
cont_tab
col2
col1 A B C
A 4 0 0
B 1 3 0
C 1 0 3
As you can see some pairs didn't occur: (A,B), (A,C), (B,C), (C,B). The end goal of my analysis is to list all of the pairs (in this case 9) and show a statistic for each of them. While using dplyr::group_by()
function I hit a limitation. Namely, the dplyr::group_by()
considers only existing pairs (pairs that occured at least once):
data %>%
group_by(col1, col2) %>%
summarize(stat = sum(val2) - sum(val1))
# A tibble: 5 x 3
# Groups: col1 [?]
col1 col2 stat
<fct> <fct> <dbl>
1 A A 58.1
2 B A -16.4
3 B B 17.0
4 C A -12.9
5 C C -41.9
The output I have in mind has 9 rows (4 of which has stat
equal to 0). Is it doable in dplyr
?
EDIT: Sorry for being too vague at the beginning. The real problem is more complex than counting the number of times a particular pair occurs. I added the new data in order to make the real problem more visible.
It is much easier to add spread
from tidyr
to get the same result as with table
library(dplyr)
library(tidyr)
count(data, col1, col2) %>%
spread(col2, n, fill = 0)
# A tibble: 3 x 4
# Groups: col1 [3]
# col1 A B C
# <fct> <dbl> <dbl> <dbl>
#1 A 4 0 0
#2 B 1 3 0
#3 C 1 0 3
NOTE: The group_by/summarise
step is changed to count
here
As @divibisan suggested, if the OP wanted long format, then add gather
at the end
data %>%
group_by(col1, col2) %>%
summarize(stat = n()) %>%
spread(col2, stat, fill = 0) %>%
gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups: col1 [3]
# col1 col2 stat
# <fct> <chr> <dbl>
#1 A A 4
#2 B A 1
#3 C A 1
#4 A B 0
#5 B B 3
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 3
Update
With the updated data in OP's post
data %>%
group_by(col1, col2) %>%
summarize(stat = sum(val2) - sum(val1)) %>%
spread(col2, stat, fill = 0) %>%
gather(col2, stat, -1)
# A tibble: 9 x 3
# Groups: col1 [3]
# col1 col2 stat
# <fct> <chr> <dbl>
#1 A A 7.76
#2 B A -20.8
#3 C A 6.97
#4 A B 0
#5 B B 28.8
#6 C B 0
#7 A C 0
#8 B C 0
#9 C C 9.56
这篇关于R dplyr的group_by也考虑空组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!