笛卡尔积与过滤器数据表 [英] Cartesian product with filter data.table

查看:111
本文介绍了笛卡尔积与过滤器数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用data.table调用替换由SQL生成的笛卡尔积。
我有很大的历史资产和值,我需要所有组合的一个子集。
我们假设我有一个T = [date,contract,value]的表a。在SQL中,它看起来像

I'm trying to replace Cartesian product produced by SQL by data.table call. I have large history with assets and values, and I need a subset of all combinations. Let's say that I have table a with T = [date, contract, value]. In SQL it looks like

SELECT a.date, a.contract, a.value, b.contract. b.value 
FROM T a, T b
WHERE a.date = b.date AND a.contract <> b.contract AND a.value + b.value < 4

在RI中现在有以下

library(data.table)

n <- 1500
dt <- data.table(date     = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
                 contract = c(rep("a", n), rep("b", n), rep("c", n)),
                 value    = c(rep(1, n), rep(2, n), rep(3, n)))
setkey(dt, date)

dt[dt, allow.cartesian = TRUE][(contract != i.contract) & (value + i.value < 4)]



我相信我的解决方案首先创建所有组合在这种情况下为13,500行),然后过滤(至3000)。 SQL但是(我可能是错误的)加入子集,更重要的是不要将所有组合加载到RAM。任何想法如何使用data.table更高效?

I believe that my solution creates all combinations first (in this case 13,500 rows) and then filter (to 3000). SQL however (and I might be wrong) joining subset, and what is more important don't load all combinations into RAM. Any ideas how to use data.table more efficient?

推荐答案

使用 by = .EACHI 功能。 data.table 连接子集即,加入只是另一个子集 - 使用 data.table - 而不是通常的整数/逻辑/行名称。他们是以这种方式设计的,考虑到这些情况。

Use by=.EACHI feature. In data.table joins and subsets are very closely linked; i.e., a join is just another subset - using data.table - instead of the usual integer / logical / row names. They are designed this way with these cases in mind.

基于子集的连接允许在加入时合并 j-表达式和分组操作。

Subset based joins allow to incorporate j-expressions and grouping operations together while joining.

require(data.table)
dt[dt, .SD[contract != i.contract & value + i.value < 4L], by=.EACHI, allow=TRUE]

这是惯用的方式'd喜欢使用 i。* cols只是为了条件,但不返回它们),但是 .SD 尚未优化,并且对 .SD 为每个组评估j表达式是昂贵的。

This is the idiomatic way (in case you'd like to use i.* cols just for condition, but not return them as well), however, .SD has not yet been optimised, and evaluating the j-expression on .SD for each group is costly.

system.time(dt[dt, .SD[contract != i.contract & value + i.value < 4L], by=.EACHI, allow=TRUE])
#    user  system elapsed 
#   2.874   0.020   2.983 

某些情况下使用 .SD code>已经过优化。在这些情况被处理之前,您可以这样解决:

Some cases using .SD have already been optimised. Until these cases are taken care of, you can workaround it this way:

dt[dt, {
        idx = contract != i.contract & value + i.value < 4L
        list(contract = contract[idx],
             value = value[idx], 
             i.contract = i.contract[any(idx)],
             i.value = i.value[any(idx)]
        )
       }, by=.EACHI, allow=TRUE]

这需要 0.045 秒,而不是 0.005 秒。但 by = .EACHI 每次评估 j-expression (因此内存效率高)。这是你必须接受的折衷。

And this takes 0.045 seconds, as opposed to 0.005 seconds from your method. But by=.EACHI evaluates the j-expression each time (and therefore memory efficient). That's the trade-off you'll have to accept.

这篇关于笛卡尔积与过滤器数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆