笛卡尔积与过滤器数据表 [英] Cartesian product with filter data.table
问题描述
我想用data.table调用替换由SQL生成的笛卡尔积。
我有很大的历史资产和值,我需要所有组合的一个子集。
我们假设我有一个T = [date,contract,value]的表a。在SQL中,它看起来像
I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like
SELECT a.date, a.contract, a.value, b.contract. b.value
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4
在RI中现在有以下
library(data.table)
n <- 1500
dt <- data.table(date = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
contract = c(rep("a", n), rep("b", n), rep("c", n)),
value = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)
dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]
我相信我的解决方案首先创建所有组合在这种情况下为13,500行),然后过滤(至3000)。 SQL但是(我可能是错误的)加入子集,更重要的是不要将所有组合加载到RAM。任何想法如何使用data.table更高效?
I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?
推荐答案
使用 by = .EACHI 功能。 data.table
连接和子集即,加入只是另一个子集 - 使用 data.table
- 而不是通常的整数/逻辑/行名称。他们是以这种方式设计的,考虑到这些情况。
Use by=.EACHI feature. In data.table
joins and subsets are very closely linked; i.e., a join is just another subset - using data.table
- instead of the usual integer / logical / row names. They are designed this way with these cases in mind.
基于子集的连接允许在加入时合并 j-表达式
和分组操作。
Subset based joins allow to incorporate j-expressions
and grouping operations together while joining.
require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by=.EACHI, allow=TRUE]
这是惯用的方式'd喜欢使用 i。*
cols只是为了条件,但不返回它们),但是 .SD
尚未优化,并且对 .SD
为每个组评估j表达式是昂贵的。
This is the idiomatic way (in case you'd like to use i.*
cols just for condition, but not return them as well), however, .SD
has not yet been optimised, and evaluating the j-expression on .SD
for each group is costly.
system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by=.EACHI, allow=TRUE])
# user system elapsed
# 2.874 0.020 2.983
某些情况下使用 .SD code>已经过优化
。在这些情况被处理之前,您可以这样解决:
Some cases using .SD
have already been optimised. Until these cases are taken care of, you can workaround it this way:
dt[dt, {
idx = contract != i.contract & value + i.value < 4L
list(contract = contract[idx],
value = value[idx],
i.contract = i.contract[any(idx)],
i.value = i.value[any(idx)]
)
}, by=.EACHI, allow=TRUE]
这需要 0.045 秒,而不是 0.005 秒。但 by = .EACHI
每次评估 j-expression
(因此内存效率高)。这是你必须接受的折衷。
And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by=.EACHI
evaluates the j-expression
each time (and therefore memory efficient). That's the trade-off you'll have to accept.
这篇关于笛卡尔积与过滤器数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!