条件(不相等)联接到data.table中 [英] Conditional (inequality) join in data.table

查看:40
本文介绍了条件(不相等)联接到data.table中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是想弄清楚如何对两个data.tables进行条件连接.

I'm just trying to figure out how to do a conditional join on two data.tables.

我写了一个 sqldf 条件连接,给我的电路的开始或结束时间在对方的开始/完成时间之内.

I've written a sqldf conditional join to give me the circuits whose start or finish times are within the other's start/finish times.

sqldf("select dt2.start, dt2.finish, dt2.counts, dt1.id, dt1.circuit 
      from dt2 
        left join dt1 on (
          (dt2.start  >= dt1.start and dt2.start  < dt1.finish) or 
          (dt2.finish >= dt1.start and dt2.finish < dt1.finish)
        )")

这给了我正确的结果,但是对于我庞大的数据集来说太慢了.

This gives me the correct result, but it's too slow for my large-ish data set.

没有向量扫描的 data.table 方法是什么?

What's the data.table way to do this without a vector scan?

这是我的数据:

dt1 <- data.table(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L, 
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200, 
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000, 
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""), 
    end = structure(c(1393626600, 1393631519, 1393639200, 1393632000, 
    1393660500, 1393673400, 1393667999, 1393671600, 1393677000
    ), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001", 
    "1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
    ), class = "factor")), .Names = c("circuit", "start", "end", 
"id"), class = "data.frame", row.names = c(NA, -9L)))


dt2 <- data.table(structure(list(start = structure(c(1393621200, 1393624800, 1393626600, 
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299, 
1393655999, 1393660500, 1393666199, 1393671600, 1393673400), class = c("POSIXct", 
"POSIXt"), tzone = ""), end = structure(c(1393624799, 1393626600, 
1393627919, 1393628399, 1393632000, 1393639200, 1393646399, 1393650299, 
1393655999, 1393660500, 1393666199, 1393671600, 1393673400, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), seconds = c(3599L, 
1800L, 1319L, 480L, 3601L, 7200L, 7199L, 3900L, 5700L, 4501L, 
5699L, 5401L, 1800L, 3600L), counts = c(1L, 1L, 0L, 1L, 2L, 1L, 
0L, 1L, 2L, 3L, 2L, 3L, 2L, 1L)), .Names = c("start", "end", 
"seconds", "counts"), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 12L, 13L, 14L, 15L), class = "data.frame"))

推荐答案

使用非装备联接:

ans = dt1[dt2, on=.(start <= end, end > start), 
           .(i.start, i.end, counts, id, circuit, cndn = i.start < x.start & i.end >= x.end), 
           allow.cartesian=TRUE
        ][!cndn %in% TRUE]

条件 start< = end,end> = start (请注意在两种情况下均为> = )会检查是否两个间隔以任何方式重叠.一侧的打开间隔由 end>完成.开始部分(> 而不是> = ).但它仍然可以选择以下类型的间隔:

The condition start <= end, end >= start (note the >= on both cases) would check if two intervals overlap by any means. The open interval on one side is accomplished by end > start part (> instead of >=). But still it also picks up the intervals of type:

         dt1: start=================end
   dt2: start--------------------------------end ## start < start, end > end

         dt1: start=================end
                dt2: start----------end          ## end == end

cndn 列用于检查并删除这些情况.希望这些情况不多,这样我们就不会不必要地实现不需要的行.

The cndn column is to check and remove these cases. Hopefully, those cases aren't a lot so that we don't materialise unwanted rows unnecessarily.

PS:在这种情况下,解决方案并不像我想的那样简单,这是因为解决方案需要 OR 操作.可以执行两个条件连接,然后将它们绑定在一起.

PS: the solution in this case is not as straightforward as I'd like to still, and that's because the solution requires an OR operation. It is possible to do two conditional joins, and then bind them together though.

也许在某个时候,我们将不得不考虑以更直接的方式将联接扩展到此类操作的可行性.

Perhaps at some point, we'll have to think about the feasibility of extending joins to these kinds of operations in a more straightforward manner.

这篇关于条件(不相等)联接到data.table中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆