为什么allow.cartesian需要在什么时候加入data.tables与重复的键? [英] Why is allow.cartesian required at times when when joining data.tables with duplicate keys?
问题描述
这里有一些实验,我有一个小的实验尝试:
library(data.table)
options(stringsAsFactors = FALSE)
x < - data.table(keyVar = c(a,b,c,c),
value = c(1,2,3,4))
setkey (x,keyVar)
y1< - data.frame(name = c(d,c,a))
x [J ]
## OK
y2< - data.frame(name = c(d,c,a,b))
x [ J(y2 $ name),]
##错误:见下面
x2 < - data.table(keyVar = c(a,b,c) ,
value = c(1,2,3))
setkey(x2,keyVar)
x2 [J(y2 $ name),]
## OK
我得到的错误信息是:
在vecseq中出错(f__,len__,if(allow.cartesian)NULL else as.integer(max(nrow(x),:
)连接结果为5行;检查i中是否存在重复的键
值,每个值都重复加入x中的同一个组。n(nrow(x),nrow(i)如果这是
ok,尝试包括`j`和放弃`by`(by-without-by),这样j为每个组
运行,以避免大的分配。如果您确定要继续,请使用
allow.cartesian = TRUE重新运行。否则,请在FAQ,Wiki中搜索此错误消息,
Stack Overflow和datatable-help以获取建议。
我真的不明白这一点。我知道我应该避免在查找函数中的重复键,我只是想获得一些洞察,所以我不会在以后做任何错误。
感谢吨帮帮我。这是一个很棒的工具。
你不必避免重复键。只要结果不大于 max(nrow(x),nrow(i))
,你就不会得到这个错误,即使你有重复。这基本上是一项预防措施。
当您重复键时,生成的连接有时会变得更大。由于 data.table
早已知道此连接将导致的总行数,因此它提供此错误消息并要求您使用参数 allow.cartesian = TRUE
如果你真的确定。
这是一个(夸张的)示例,说明此错误消息背后的想法:
require(data.table)
pre>
DT1 < - data.table(x = rep(letters [1:2],c(1e2,1e7)),
y = 1L,key = x)
DT2 < - data.table(x = rep(b,3),key =x)
#不运行
#DT1 [DT2] ## error
dim(DT1 [DT2,allow.cartesian = TRUE])
#[1] 30000000 2
DT2
中的重复项导致DT1
(= 1e7)。想象一下,如果您在DT2
中使用1e4值执行连接,结果将会爆炸!为了避免这种情况,还有allow.cartesian
参数,默认值为FALSE。
马特曾经提到,可能只是提供错误大连接(或连接,导致大量的行 - 这可能是任意设置我猜)。这个,当/如果实现,将使连接正确,没有这个错误消息,如果连接不组合爆炸。
I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.
Here's a little experiment I have tried:
library(data.table) options(stringsAsFactors = FALSE) x <- data.table(keyVar = c("a", "b", "c", "c"), value = c( 1, 2, 3, 4)) setkey(x, keyVar) y1 <- data.frame(name = c("d", "c", "a")) x[J(y1$name), ] ## OK y2 <- data.frame(name = c("d", "c", "a", "b")) x[J(y2$name), ] ## Error: see below x2 <- data.table(keyVar = c("a", "b", "c"), value = c( 1, 2, 3)) setkey(x2, keyVar) x2[J(y2$name), ] ## OK
The error message I am getting is :
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.
Thanks a ton for help. This is a great tool.
解决方案You don't have to avoid duplicate keys. As long as the result does not get bigger than
max(nrow(x), nrow(i))
, you won't get this error, even if you've duplicates. It is basically a precautionary measure.When you've duplicate keys, the resulting join can sometimes get much bigger. Since
data.table
knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argumentallow.cartesian=TRUE
if you're really sure.Here's an (exaggerated) example that illustrates the idea behind this error message:
require(data.table) DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)), y=1L, key="x") DT2 <- data.table(x=rep("b", 3), key="x") # not run # DT1[DT2] ## error dim(DT1[DT2, allow.cartesian=TRUE]) # [1] 30000000 2
The duplicates in
DT2
resulted in 3 times the total number of "a" inDT1
(=1e7). Imagine if you performed the join with 1e4 values inDT2
, the results would explode! To avoid this, there's theallow.cartesian
argument which by default is FALSE.That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.
这篇关于为什么allow.cartesian需要在什么时候加入data.tables与重复的键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!