使用重复键滚动数据表上的连接 [英] Rolling join on data.table with duplicate keys

查看：105 发布时间：2017/3/12 9:55:58 r join data.table

本文介绍了使用重复键滚动数据表上的连接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想要在 data.table 中了解滚动连接。

在给定时间在机场交易的data.table：

 > dt 
 t_id airport thisTime 
 1：1 a 5.1 
 2：3 a 5.1 
 3：2 a 6.2

（注意 t_ids 1& 3具有相同的机场和时间）

和从机场起飞的航班查询表：

 > dt_lookup 
 f_id airport this time 
 1：1 a 6 
 2：2 a 6 
 3：1 b 7 
 4：1 c 8 
 5： 2 d 7 
 6：1 d 9 
 7：2 e 8 
 
> tables（）
 NAME NROW NCOL MB COLS KEY 
 [1，] dt 3 3 1 t_id，airport，thisTime airport，thisTime 
 [2，] dt_lookup 7 3 1 f_id，airport，thisTime机场，thisTime

我想将所有交易与从该机场起飞的所有下一个可能的航班，给出：

  t_id airport thisTime f_id 
 1 a 6 1 
 1 a 6 2 
 3 a 6 1 
 3 a 6 2

 > dt [dt_lookup，nomatch = 0，roll = Inf] 
 t_id airport thisTime f_id 
 1：3 a 6 1 
 2：3 a 6 2

但尚未返回交易 t_id == 1 。

从文档它说：

通常情况下，x的键中不应有重复项...

但是，我确实在我的x键（即 airport & thisTime ），并且不能完全看到/理解这意味着 t_id = 1 从输出中删除。

任何人都可以了解为什么不会返回 t_id = 1

code> library（data.table）
dt< - data.table（t_id = seq（1：3），
airport = c（a，a，a ），
thisTime = c（5.1,6.2,5.1），key = c（airport，thisTime））

dt_lookup< ;- data.table（f_id = c （rep（1,4），rep（2,3）），
airport = c（a，b，c，d，
a，d ，e），
thisTime = c（6,7,8,9，
6,7,8），key = c（airport，thisTime））$ b $ t_id = 1 code>不会显示在输出中是因为滚动连接获取键组合最后出现的行。从文档（强调我的）：适用于最后一个连接列，通常是一个日期，但可以是任何有序变量，不规则和包括差距。如果roll = TRUE，并且i的行匹配除了最后一个x连接列之外的所有行，并且它在中的值最后一个连接列落在一个间隙中（包括在最后一个后的观察值x对于该组），则x中的当前值是。这个操作使用修改的二进制搜索特别快。操作也称为最后一次观察进行 forward（LOCF）。稍大的数据集： > DT t_id airport thisTime 1：1 a 5.1 2：4 a 5.1 3：3 a 5.1 4：2 d 6.2 5： 5 d 6.2 > DT_LU f_id airport这个时间 1：1 a 6 2：2 a 6 3：2 a 8 4：1 b 7 5： 1 c 8 6：2 d 7 7：1 d 9 当您按照您的问题执行滚动连接时： DT [DT_LU，nomatch = 0，roll = Inf] 您会得到： t_id airport thisTime f_id 1：3 a 6 1 2：3 a 6 2 3：3 a 8 2 4：5 d 7 2 5：5 d 9 1
正如你所看到的， $ c> a，5.1 和 d，6.2 最后一行用于连接的数据。因为您使用 Inf 作为滚动值，所有未来的值都会合并到结果数据表中。使用时：
DT [DT_LU，nomatch = 0，roll = 1]

您会发现只有未来的第一个值包含在内：
t_id airport thisTime f_id 1：3 a 6 1 2：3 a 6 2 3：5 d 7 2

如果您想要 f_id 为 airport & thisTime 其中 DT $ thisTime 低于 DT_LU $ thisTime ，您可以通过使用 ceiling 函数创建一个新变量（或替换现有的 thisTime ）来实现。我创建一个新变量 thisTime2 ，然后用 DT_LU 执行正常连接的示例：
DT [，thisTime2：= ceiling（thisTime）] setkey（DT，airport，thisTime2）[DT_LU，nomatch = 0]
其中：
t_id airport thisTime thisTime2 f_id 1：1 a 5.1 6 1 2：4 a 5.1 6 1 3：3 a 5.1 6 1 4：1 a 5.1 6 2 5：4 a 5.1 6 2 6：3 a 5.1 6 2 7：2 d 6.2 7 2 8：5 d 6.2 7 2
应用于您提供的数据：
> dt [，thisTime2：= ceiling（thisTime）] > setkey（dt，airport，thisTime2）[dt_lookup，nomatch = 0] t_id airport thisTime thisTime2 f_id 1：1 a 5.1 6 1 2：3 a 5.1 6 1 3：1 a 5.1 6 2 4：3 a 5.1 6 2

当你想包含未来的值而不是第一个值时，你需要一个稍微不同的方法，你需要 i.col 功能（尚未记载）：

1 ：首先将键设置为机场列：
setkey（DT，airport） setkey（DT_LU，机场）
2 ：使用 i。 j 中的功能（尚未记录）以获得所需的内容：
DT1 tTime = i.thisTime， fTime = thisTime [i.thisTime < thisTime]， fid = f_id [i.thisTime < thisTime]）， by = .EACHI]

> DT1 airport tid tTime fTime fid 1：a 1 5.1 6 1 2：a 1 5.1 6 2 3：a 1 5.1 8 2 4：a 4 5.1 6 1 5：a 4 5.1 6 2 6：a 4 5.1 8 2 7：a 3 5.1 6 1 8：a 3 5.1 6 2 9：a 3 5.1 8 2 10：d 2 6.2 7 2 11：d 2 6.2 9 1 12：d 5 6.2 7 2 13：d 5 6.2 9 1
一些解释：如果您加入两个使用相同列名称的数据表，您可以通过在 i。之前的列名称引用 i 中的数据类型的列。现在可以比较 thisTime 从 DT 与 thisTime DT_LU 。使用 by = .EACHI ，您可以确保所有具有条件保留的组合都包含在结果数据表中。

或者，您可以通过以下方式实现相同：
DT2 < - DT_LU [DT，（airport = i.airport， tid = i.t_id， tTime = i.thisTime， fTime = thisTime [i.thisTime fid = f_id [i.thisTime allow.cartesian = TRUE]
其结果相同：
>相同（DT1，DT2） [1] TRUE
值在一定边界内，您可以使用：
DT1 { idx = i.thisTime<这个时间& thisTime-i.thisTime < 2 。（tid = i.t_id， tTime = i.thisTime， fTime = thisTime [idx]， fid = f_id [idx]） } ， by = .EACHI]
其中：
> DT1 airport tid tTime fTime fid 1：a 1 5.1 6 1 2：a 1 5.1 6 2 3：a 4 5.1 6 1 4：a 4 5.1 6 2 5：a 3 5.1 6 1 6：a 3 5.1 6 2 7：d 2 6.2 7 2 8：d 5 6.2 7 2
当你将它与上一个结果进行比较时，你会看到第3，6，9，10行12已删除。

$ b
DT < - data.table（t_id = c（1,4,2,3,5）， airport = c（a， a，d，a，d）， thisTime = c（5.1,5.1,6.2,5.1,6.2）， key = c ）） DT_LU< - data.table（f_id = c（rep（1,4），rep（2,3））， airport = c（a b，c，d，a，d，e）， thisTime = c（6,7,8,9,6,7,8）， key = c（airport，thisTime））

I'm trying to understand rolling joins in data.table. The data to reproduce this is given at the end.

Given a data.table of transactions at an airport, at a given time:
> dt t_id airport thisTime 1: 1 a 5.1 2: 3 a 5.1 3: 2 a 6.2
(note t_ids 1 & 3 have the same airport and time)

and a lookup table of flights departing from airports:
> dt_lookup f_id airport thisTime 1: 1 a 6 2: 2 a 6 3: 1 b 7 4: 1 c 8 5: 2 d 7 6: 1 d 9 7: 2 e 8 > tables() NAME NROW NCOL MB COLS KEY [1,] dt 3 3 1 t_id,airport,thisTime airport,thisTime [2,] dt_lookup 7 3 1 f_id,airport,thisTime airport,thisTime
I would like to match all the transactions to all the next possible flights departing from that airport, to give:
t_id airport thisTime f_id 1 a 6 1 1 a 6 2 3 a 6 1 3 a 6 2
So I thought this would work:
> dt[dt_lookup, nomatch=0,roll=Inf] t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2
But it hasn't returned transactions t_id == 1.

From the documentation it says:

Usually, there should be no duplicates in x’s key,...

However, I do have duplicates in my 'x key' (namely airport & thisTime), and can't quite see/understand what's going on to mean t_id = 1 gets removed from the output.

Can anyone shed some light as to why t_id = 1 is not returned, and how can I get the join to work for when I have duplicates?

Data
library(data.table) dt <- data.table(t_id = seq(1:3), airport = c("a","a","a"), thisTime = c(5.1,6.2, 5.1), key=c( "airport","thisTime")) dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)), airport = c("a","b","c","d", "a","d","e"), thisTime = c(6,7,8,9, 6,7,8), key=c("airport","thisTime"))

解决方案
The reason that t_id = 1 doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF).

Let's consider somewhat larger datasets:
> DT t_id airport thisTime 1: 1 a 5.1 2: 4 a 5.1 3: 3 a 5.1 4: 2 d 6.2 5: 5 d 6.2 > DT_LU f_id airport thisTime 1: 1 a 6 2: 2 a 6 3: 2 a 8 4: 1 b 7 5: 1 c 8 6: 2 d 7 7: 1 d 9
When you perform a rolling join just like in your question:
DT[DT_LU, nomatch=0, roll=Inf]
you get:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 3 a 8 2 4: 5 d 7 2 5: 5 d 9 1
As you can see, from both the key combination a, 5.1 and d, 6.2 the last row is used for the joined datatable. Because you use Inf as roll-value, all the future values are incorporated in the resulting datatable. When you use:
DT[DT_LU, nomatch=0, roll=1]
you see that only the first value in the future is included:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 5 d 7 2

If you want the f_id's for for all combinations of airport & thisTime where DT$thisTime is lower than DT_LU$thisTime, you can achieve that by creating a new variable (or replacing the existing thisTime) by means of the ceiling function. An example where I create a new variable thisTime2 and then do a normal join with DT_LU:
DT[, thisTime2 := ceiling(thisTime)] setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]
which gives:
t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 4 a 5.1 6 1 3: 3 a 5.1 6 1 4: 1 a 5.1 6 2 5: 4 a 5.1 6 2 6: 3 a 5.1 6 2 7: 2 d 6.2 7 2 8: 5 d 6.2 7 2
Applied to the data you provided:
> dt[, thisTime2 := ceiling(thisTime)] > setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0] t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 3 a 5.1 6 1 3: 1 a 5.1 6 2 4: 3 a 5.1 6 2

When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the i.col functionality (which is not documented yet):

1: First set the key to only the airport columns:
setkey(DT, airport) setkey(DT_LU, airport)
2: Use the i.col functionality (which is not documented yet) in j to get what you want as follows:
DT1 <- DT_LU[DT, .(tid = i.t_id, tTime = i.thisTime, fTime = thisTime[i.thisTime < thisTime], fid = f_id[i.thisTime < thisTime]), by=.EACHI]
this gives you:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 1 5.1 8 2 4: a 4 5.1 6 1 5: a 4 5.1 6 2 6: a 4 5.1 8 2 7: a 3 5.1 6 1 8: a 3 5.1 6 2 9: a 3 5.1 8 2 10: d 2 6.2 7 2 11: d 2 6.2 9 1 12: d 5 6.2 7 2 13: d 5 6.2 9 1
Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in i by preceding the columnnames with i.. Now it's possible to compare thisTime from DT with thisTime from DT_LU. With by = .EACHI you assure that all combinations for with the condition holds are included in the resulting datatable.

Alternatively, you can achieve the same with:
DT2 <- DT_LU[DT, .(airport=i.airport, tid=i.t_id, tTime=i.thisTime, fTime=thisTime[i.thisTime < thisTime], fid=f_id[i.thisTime < thisTime]), allow.cartesian=TRUE]
which gives the same result:
> identical(DT1, DT2) [1] TRUE
When you only want to include future values within a certain boundary, you can use:
DT1 <- DT_LU[DT, { idx = i.thisTime < thisTime & thisTime - i.thisTime < 2 .(tid = i.t_id, tTime = i.thisTime, fTime = thisTime[idx], fid = f_id[idx]) }, by=.EACHI]
which gives:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 4 5.1 6 1 4: a 4 5.1 6 2 5: a 3 5.1 6 1 6: a 3 5.1 6 2 7: d 2 6.2 7 2 8: d 5 6.2 7 2
When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.

Data:
DT <- data.table(t_id = c(1,4,2,3,5), airport = c("a","a","d","a","d"), thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("airport","thisTime")) DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)), airport = c("a","b","c","d","a","d","e"), thisTime = c(6,7,8,9,6,7,8), key=c("airport","thisTime"))

这篇关于使用重复键滚动数据表上的连接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用重复键滚动数据表上的连接 [英] Rolling join on data.table with duplicate keys

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用重复键滚动数据表上的连接 [英] Rolling join on data.table with duplicate keys

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭