为什么allow.cartesian需要在什么时候加入data.tables与重复的键? [英] Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

查看:469
本文介绍了为什么allow.cartesian需要在什么时候加入data.tables与重复的键?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这里有一些实验,我有一个小的实验尝试:

  library(data.table)
options(stringsAsFactors = FALSE)

x < - data.table(keyVar = c(a,b,c,c),
value = c(1,2,3,4))
setkey (x,keyVar)

y1< - data.frame(name = c(d,c,a))
x [J ]
## OK

y2< - data.frame(name = c(d,c,a,b))
x [ J(y2 $ name),]
##错误:见下面

x2 < - data.table(keyVar = c(a,b,c) ,
value = c(1,2,3))
setkey(x2,keyVar)
x2 [J(y2 $ name),]
## OK

我得到的错误信息是:

 在vecseq中出错(f__,len__,if(allow.cartesian)NULL else as.integer(max(nrow(x),:
)连接结果为5行;检查i中是否存在重复的键
值,每个值都重复加入x中的同一个组。n(nrow(x),nrow(i)如果这是
ok,尝试包括`j`和放弃`by`(by-without-by),这样j为每个组
运行,以避免大的分配。如果您确定要继续,请使用
allow.cartesian = TRUE重新运行。否则,请在FAQ,Wiki中搜索此错误消息,
Stack Overflow和datatable-help以获取建议。

我真的不明白这一点。我知道我应该避免在查找函数中的重复键,我只是想获得一些洞察,所以我不会在以后做任何错误。



感谢吨帮帮我。这是一个很棒的工具。

解决方案

你不必避免重复键。只要结果不大于 max(nrow(x),nrow(i)),你就不会得到这个错误,即使你有重复。这基本上是一项预防措施。



当您重复键时,生成的连接有时会变得更大。由于 data.table 早已知道此连接将导致的总行数,因此它提供此错误消息并要求您使用参数 allow.cartesian = TRUE 如果你真的确定。



这是一个(夸张的)示例,说明此错误消息背后的想法:

  require(data.table)
DT1 < - data.table(x = rep(letters [1:2],c(1e2,1e7)),
y = 1L,key = x)
DT2 < - data.table(x = rep(b,3),key =x)

#不运行
#DT1 [DT2] ## error

dim(DT1 [DT2,allow.cartesian = TRUE])
#[1] 30000000 2
pre>

DT2 中的重复项导致 DT1 (= 1e7)。想象一下,如果您在 DT2 中使用1e4值执行连接,结果将会爆炸!为了避免这种情况,还有 allow.cartesian 参数,默认值为FALSE。



马特曾经提到,可能只是提供错误大连接(或连接,导致大量的行 - 这可能是任意设置我猜)。这个,当/如果实现,将使连接正确,没有这个错误消息,如果连接不组合爆炸。


I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.

Here's a little experiment I have tried:

library(data.table)
options(stringsAsFactors = FALSE)

x <- data.table(keyVar = c("a", "b", "c", "c"),
            value  = c(  1,   2,   3,   4))
setkey(x, keyVar)

y1 <- data.frame(name = c("d", "c", "a"))
x[J(y1$name), ]
## OK

y2 <- data.frame(name = c("d", "c", "a", "b"))
x[J(y2$name), ]
## Error: see below

x2 <- data.table(keyVar = c("a", "b", "c"),
                 value  = c(  1,   2,   3))
setkey(x2, keyVar)
x2[J(y2$name), ]
## OK

The error message I am getting is :

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  :
Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key
values in i, each of which join to the same group in x over and over again. If that's
ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group
to avoid the large allocation. If you are sure you wish to proceed, rerun with 
allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, 
Stack Overflow and datatable-help for advice.

I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.

Thanks a ton for help. This is a great tool.

解决方案

You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i)), you won't get this error, even if you've duplicates. It is basically a precautionary measure.

When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE if you're really sure.

Here's an (exaggerated) example that illustrates the idea behind this error message:

require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)), 
                  y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")

# not run
# DT1[DT2] ## error

dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000        2

The duplicates in DT2 resulted in 3 times the total number of "a" in DT1 (=1e7). Imagine if you performed the join with 1e4 values in DT2, the results would explode! To avoid this, there's the allow.cartesian argument which by default is FALSE.

That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.

这篇关于为什么allow.cartesian需要在什么时候加入data.tables与重复的键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆