使用dplyr计算左连接的结果 [英] Counting the result of a left join using dplyr
问题描述
使用dplyr计算左外连接的结果的正确方法是什么?
考虑两个数据框:
a< - data.frame(id = c(1,2,3,4))
b< - data.frame(id = c (1,3,3,3,4),ref_id = c('a','b','c','d','e','f'))
a
指定四个不同的ID。 b
指定在 a
中引用ID的六条记录。如果我想看到每个ID被引用多少次,我可以尝试这样:
a%>%left_join(b ,by ='id')%>%group_by(id)%>%summarize(refs = n())
但是,结果是误导,因为它表示ID
来源:本地数据框[4 x 2]
id refs
(dbl)(int)
1 1 2
2 2 1
3 3 3
4 4 1
2
被引用一次,实际上,它从未被引用(在ID 2的中间数据框中,ref_id为NA
)。我想避免引入一个单独的库,如sqldf
。解决方案使用data.table,您可以执行
library(data.table)
setDT(a); setDT(b)
b [a,.N,on =id,by = .EACHI]
id N
1:1 2
2:2 0
3:3 3
4:4 1
这里,语法是
x [i,j,on,by = .EACHI]
。
-
.EACHI
指每一行i = a
。 -
j = .N
为行数使用一个特殊变量。
What is the proper way to count the result of a left outer join using dplyr?
Consider the two data frames:
a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )
a
specifies four different IDs. b
specifies six records that reference IDs in a
. If I want to see how many times each ID is referenced, I might try this:
a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]
id refs
(dbl) (int)
1 1 2
2 2 1
3 3 3
4 4 1
However, the result is misleading because it indicates that ID 2
was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA
for ID 2). I would like to avoid introducing a separate library such as sqldf
.
With data.table, you can do
library(data.table)
setDT(a); setDT(b)
b[a, .N, on="id", by=.EACHI]
id N
1: 1 2
2: 2 0
3: 3 3
4: 4 1
Here, the syntax is x[i, j, on, by=.EACHI]
.
.EACHI
refers to each row ofi=a
.j=.N
uses a special variable for the number of rows.
这篇关于使用dplyr计算左连接的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!