为什么在使用重复键连接 data.tables 时有时需要 allow.cartesian? [英] Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

查看:14
本文介绍了为什么在使用重复键连接 data.tables 时有时需要 allow.cartesian?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当 R 中的 data.table 中存在重复键时,我试图了解 J() 查找的逻辑.

I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.

这是我尝试过的一个小实验:

Here's a little experiment I have tried:

library(data.table)
options(stringsAsFactors = FALSE)

x <- data.table(keyVar = c("a", "b", "c", "c"),
            value  = c(  1,   2,   3,   4))
setkey(x, keyVar)

y1 <- data.frame(name = c("d", "c", "a"))
x[J(y1$name), ]
## OK

y2 <- data.frame(name = c("d", "c", "a", "b"))
x[J(y2$name), ]
## Error: see below

x2 <- data.table(keyVar = c("a", "b", "c"),
                 value  = c(  1,   2,   3))
setkey(x2, keyVar)
x2[J(y2$name), ]
## OK

我得到的错误信息是:

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  :
Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key
values in i, each of which join to the same group in x over and over again. If that's
ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group
to avoid the large allocation. If you are sure you wish to proceed, rerun with 
allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, 
Stack Overflow and datatable-help for advice.

我真的不明白这一点.我知道我应该避免在查找函数中出现重复键,我只是想获得一些见解,这样我以后就不会犯任何错误了.

I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.

非常感谢您的帮助.这是一个很棒的工具.

Thanks a ton for help. This is a great tool.

推荐答案

您不必避免重复键.只要结果不大于 max(nrow(x), nrow(i)),即使有重复,也不会出现此错误.这基本上是一种预防措施.

You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i)), you won't get this error, even if you've duplicates. It is basically a precautionary measure.

当您复制键时,生成的连接有时会变得更大.由于 data.table 足够早地知道此连接将产生的总行数,因此它会提供此错误消息并要求您使用参数 allow.cartesian=TRUE 如果你真的确定.

When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE if you're really sure.

这是一个(夸张的)示例,说明了此错误消息背后的想法:

Here's an (exaggerated) example that illustrates the idea behind this error message:

require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)), 
                  y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")

# not run
# DT1[DT2] ## error

dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000        2

DT2 中的重复导致 DT1 中a"总数的 3 倍 (=1e7).想象一下,如果您在 DT2 中使用 1e4 值执行连接,结果会爆炸!为了避免这种情况,allow.cartesian 参数默认为 FALSE.

The duplicates in DT2 resulted in 3 times the total number of "a" in DT1 (=1e7). Imagine if you performed the join with 1e4 values in DT2, the results would explode! To avoid this, there's the allow.cartesian argument which by default is FALSE.

话虽如此,我认为 Matt 曾经提到,在大"连接(或导致大量行的连接 - 我猜这可能是任意设置)的情况下,可能只提供错误.这在/如果实施时,将在没有组合爆炸的连接的情况下正确连接而不会出现此错误消息.

That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.

这篇关于为什么在使用重复键连接 data.tables 时有时需要 allow.cartesian?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆