dplyr 0.3 can not inner_join data.table? [英] dplyr 0.3 cannot inner_join data.table?

查看:102
本文介绍了dplyr 0.3 can not inner_join data.table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下设置和dplyr(0.3)和data.table(1.9.3)加载。

  R版本3.1.1(2014-07-10)
平台:x86_64-apple-darwin10.8.0(64位)

语言环境:
[1] en_US.UTF-8 /en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

附加的基础包:
[1] stats graphics grDevices utils数据集方法base

其他附加包:
[1] data.table_1.9.3 dplyr_0.3

通过命名空间加载(未附加):
[1] assertthat_0.1 DBI_0.3.1 magrittr_1.0.1 parallel_3.1.1 plyr_1.8.1 Rcpp_0.11.2
[7] reshape2_1.4 stringr_0.6.2 tools_3.1.1

这里是数据集。 2 data.tables和2 data.frames。这两组具有相同的内容。

  DT_1 = data.table(x = rep(c(a,b c),每个= 3),y = c(1,3,6),v = 1:9)
DT_2 = data.table(V1 = c(b,c ),foo = c(4,2))

DT_1_df = data.frame(x = rep(c(a,b,c),each = 3),y = c(1,3,6),v = 1:9)
DT_2_df = data.frame(V1 = c(b,c),foo = c(4,2) $ b



data.table方式



对于使用data.table方法的两个数据表,我们得到以下结果:

  setkey(DT_1,x); setkey(DT_2,V1)
DT_1 [DT_2]
xyv foo
1:b 1 4 4
2:b 3 5 4
3:b 6 6 4
4:c 1 7 2
5:c 3 8 2
6:c 6 9 2



dplyr0.3 data.tables上的inner_join



在两个数据表上使用dplyr的inner_join时会产生错误:

  inner_join(DT_1,DT_2,by =(x=V1))
setkeyv中的错误):一些列不在data.table中:V1



dplyr0.3 inner_join on data.frame & data.table



如果使用数据框工作数据表时出现另一个错误:

  inner_join(DT_1,DT_2_df,by = c(x=V1))
错误:数据表联接必须在同一个密钥上



dplyr0.3 inner_join on data.frames



inner_join在数据框架上工作精美:

  inner_join(DT_1_df,DT_2_df,by = c(x=V1))
xyv foo
1 b 1 4 4
2 b 3 5 4
3 b 6 6 4
4 c 1 7 2
5 c 3 8 2
6 c 6 9 2

任何人都可以解释为什么会发生这种情况?

解决方案

如需完整性,请在此处发布研究结果。



检查 https://github.com/hadley/dplyr ,似乎dplyr加入目前的功能有限。引用:当前连接变量在左侧和右侧都必须相同。下面的测试似乎证实了这一点:

  library(dplyr); library(data.table)
DT_1 = data.table(x = rep(c(a,b,c),each = 3),y = c ,v = 1:9)
DT_2 = data.table(V1 = c(b,c),foo = c(4,2))注意分配给第一列的变量名
DT_2b = data.table(x = c(b,c),foo = c(4,2))#注意分配给第一列的变量名

inner_join ,DT_2b,by =x)
资料来源:本地数据表[6 x 4]
xyv foo
1 b 1 4 4
2 b 3 5 4
3 b 6 6 4
4 c 1 7 2
5 c 3 8 2
6 c 6 9 2

inner_join(DT_1,DT_2,by = c x=V1))
错误:数据表连接必须在同一个键上


I have the following setup and dplyr(0.3) and data.table(1.9.3) loaded.

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.3 dplyr_0.3       

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.1 parallel_3.1.1 plyr_1.8.1     Rcpp_0.11.2   
[7] reshape2_1.4   stringr_0.6.2  tools_3.1.1 

Here are the dataset. 2 data.tables and 2 data.frames. The two sets have the same contents.

DT_1 = data.table(x = rep(c("a","b","c"), each = 3), y = c(1,3,6), v = 1:9)
DT_2 = data.table(V1 = c("b","c"),foo = c(4,2))

DT_1_df = data.frame(x = rep(c("a","b","c"), each = 3), y = c(1,3,6), v = 1:9)
DT_2_df = data.frame(V1 = c("b","c"),foo = c(4,2))

data.table way

When do inner join on two data-tables using the data.table way, we get the following result:

setkey(DT_1, x); setkey(DT_2, V1)
DT_1[DT_2]
  x y v foo
1: b 1 4   4
2: b 3 5   4
3: b 6 6   4
4: c 1 7   2
5: c 3 8   2
6: c 6 9   2

dplyr0.3 inner_join on data.tables

It gives error when use inner_join of dplyr on two data-tables:

inner_join(DT_1, DT_2, by=("x"="V1"))
Error in setkeyv(x, by$x) : some columns are not in the data.table: V1

dplyr0.3 inner_join on data.frame & data.table

Another error if work a datatable with a dataframe:

inner_join(DT_1, DT_2_df, by = c("x" = "V1"))
Error: Data table joins must be on same key

dplyr0.3 inner_join on data.frames

inner_join however works beautifully on dataframes:

inner_join(DT_1_df, DT_2_df, by = c("x" = "V1"))
  x y v foo
1 b 1 4   4
2 b 3 5   4
3 b 6 6   4
4 c 1 7   2
5 c 3 8   2
6 c 6 9   2

Can anyone explain why this happens?

解决方案

For completeness, posting research result here.

After checking https://github.com/hadley/dplyr , it seems dplyr "join" has limited functions at the moment. To quote: "Currently join variables must be the same in both the left-hand and right-hand sides." The test below seems to confirm this:

library(dplyr); library(data.table)
DT_1 = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT_2 = data.table(V1=c("b","c"),foo=c(4,2)) # note the variable name assigned to first column
DT_2b = data.table(x=c("b","c"),foo=c(4,2)) # note the variable name assigned to first column

inner_join(DT_1, DT_2b, by= "x")
Source: local data table [6 x 4]
  x y v foo
1 b 1 4   4
2 b 3 5   4
3 b 6 6   4
4 c 1 7   2
5 c 3 8   2
6 c 6 9   2

inner_join(DT_1, DT_2, by = c("x" = "V1"))
Error: Data table joins must be on same key

这篇关于dplyr 0.3 can not inner_join data.table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆