使用dplyr计算左连接的结果 [英] Counting the result of a left join using dplyr

查看:109
本文介绍了使用dplyr计算左连接的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用dplyr计算左外连接的结果的正确方法是什么?



考虑两个数据框:

  a<  -  data.frame(id = c(1,2,3,4))
b< - data.frame(id = c (1,3,3,3,4),ref_id = c('a','b','c','d','e','f'))

a 指定四个不同的ID。 b 指定在 a 中引用ID的六条记录。如果我想看到每个ID被引用多少次,我可以尝试这样:

  a%>%left_join(b ,by ='id')%>%group_by(id)%>%summarize(refs = n())
来源:本地数据框[4 x 2]

id refs
(dbl)(int)
1 1 2
2 2 1
3 3 3
4 4 1
但是,结果是误导,因为它表示ID 2 被引用一次,实际上,它从未被引用(在ID 2的中间数据框中,ref_id为 NA )。我想避免引入一个单独的库,如 sqldf

解决方案

使用data.table,您可以执行

  library(data.table)
setDT(a); setDT(b)

b [a,.N,on =id,by = .EACHI]


id N
1:1 2
2:2 0
3:3 3
4:4 1

这里,语法是 x [i,j,on,by = .EACHI]




  • .EACHI 指每一行 i = a

  • j = .N 为行数使用一个特殊变量。


What is the proper way to count the result of a left outer join using dplyr?

Consider the two data frames:

a <- data.frame( id=c( 1, 2, 3, 4 ) )
b <- data.frame( id=c( 1, 1, 3, 3, 3, 4 ), ref_id=c( 'a', 'b', 'c', 'd', 'e', 'f' ) )

a specifies four different IDs. b specifies six records that reference IDs in a. If I want to see how many times each ID is referenced, I might try this:

a %>% left_join( b, by='id' ) %>% group_by( id ) %>% summarise( refs=n() )
Source: local data frame [4 x 2]

     id  refs
  (dbl) (int)
1     1     2
2     2     1
3     3     3
4     4     1

However, the result is misleading because it indicates that ID 2 was referenced once when in reality, it was never referenced (in the intermediate data frame, ref_id was NA for ID 2). I would like to avoid introducing a separate library such as sqldf.

解决方案

With data.table, you can do

library(data.table)
setDT(a); setDT(b)

b[a, .N, on="id", by=.EACHI]


   id N
1:  1 2
2:  2 0
3:  3 3
4:  4 1

Here, the syntax is x[i, j, on, by=.EACHI].

  • .EACHI refers to each row of i=a.
  • j=.N uses a special variable for the number of rows.

这篇关于使用dplyr计算左连接的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆