使用重复键滚动数据表上的连接 [英] Rolling join on data.table with duplicate keys
问题描述
我想要在 data.table
中了解滚动连接
。
在给定时间在机场交易的data.table:
> dt
t_id airport thisTime
1:1 a 5.1
2:3 a 5.1
3:2 a 6.2
(注意 t_ids
1& 3具有相同的机场和时间)
和从机场起飞的航班查询表:
> dt_lookup
f_id airport this time
1:1 a 6
2:2 a 6
3:1 b 7
4:1 c 8
5: 2 d 7
6:1 d 9
7:2 e 8
> tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 3 3 1 t_id,airport,thisTime airport,thisTime
[2,] dt_lookup 7 3 1 f_id,airport,thisTime机场,thisTime
我想将所有交易与从该机场起飞的所有下一个可能的航班,给出:
t_id airport thisTime f_id
1 a 6 1
1 a 6 2
3 a 6 1
3 a 6 2
> dt [dt_lookup,nomatch = 0,roll = Inf]
t_id airport thisTime f_id
1:3 a 6 1
2:3 a 6 2
但尚未返回交易 t_id == 1
。
从文档它说:
通常情况下,x的键中不应有重复项...
但是,我确实在我的x键(即 airport
& thisTime
),并且不能完全看到/理解这意味着 t_id = 1
从输出中删除。
任何人都可以了解为什么不会返回 t_id = 1
code> library(data.table)
dt< - data.table(t_id = seq(1:3),
airport = c(a,a,a ),
thisTime = c(5.1,6.2,5.1),key = c(airport,thisTime))
dt_lookup< ;- data.table(f_id = c (rep(1,4),rep(2,3)),
airport = c(a,b,c,d,
a,d ,e),
thisTime = c(6,7,8,9,
6,7,8),key = c(airport,thisTime))$ b $t_id = 1 code>不会显示在输出中是因为滚动连接获取键组合最后出现的行。从文档(强调我的):
和
适用于最后一个连接列,通常是一个日期,但可以是任何
有序变量,不规则和包括差距。如果roll = TRUE,并且i的
行匹配除了最后一个x连接列之外的所有行,并且它在
中的值最后一个连接列落在一个间隙中(包括在最后一个
后的观察值x对于该组),则x中的当前值是
。这个操作使用修改的
二进制搜索特别快。 操作也称为最后一次观察进行
forward(LOCF)。
稍大的数据集:
> DT
t_id airport thisTime
1:1 a 5.1
2:4 a 5.1
3:3 a 5.1
4:2 d 6.2
5: 5 d 6.2
> DT_LU
f_id airport这个时间
1:1 a 6
2:2 a 6
3:2 a 8
4:1 b 7
5: 1 c 8
6:2 d 7
7:1 d 9
当您按照您的问题执行滚动连接时:
DT [DT_LU,nomatch = 0,roll = Inf]
您会得到:
t_id airport thisTime f_id
1:3 a 6 1
2:3 a 6 2
3:3 a 8 2
4:5 d 7 2
5:5 d 9 1
正如你所看到的, $ c> a,5.1
d,6.2
最后一行用于连接的数据。因为您使用Inf
作为滚动值,所有未来的值都会合并到结果数据表中。使用时:DT [DT_LU,nomatch = 0,roll = 1]
您会发现只有未来的第一个值包含在内:
t_id airport thisTime f_id
1:3 a 6 1
2:3 a 6 2
3:5 d 7 2
如果您想要
f_id
为airport
&thisTime
其中DT $ thisTime
低于DT_LU $ thisTime
,您可以通过使用ceiling
函数创建一个新变量(或替换现有的thisTime
)来实现。我创建一个新变量thisTime2
,然后用DT_LU
执行正常连接的示例:DT [,thisTime2:= ceiling(thisTime)]
setkey(DT,airport,thisTime2)[DT_LU,nomatch = 0]
其中:
t_id airport thisTime thisTime2 f_id
1:1 a 5.1 6 1
2:4 a 5.1 6 1
3:3 a 5.1 6 1
4:1 a 5.1 6 2
5:4 a 5.1 6 2
6:3 a 5.1 6 2
7:2 d 6.2 7 2
8:5 d 6.2 7 2
应用于您提供的数据:
> dt [,thisTime2:= ceiling(thisTime)]
> setkey(dt,airport,thisTime2)[dt_lookup,nomatch = 0]
t_id airport thisTime thisTime2 f_id
1:1 a 5.1 6 1
2:3 a 5.1 6 1
3:1 a 5.1 6 2
4:3 a 5.1 6 2
当你想包含未来的值而不是第一个值时,你需要一个稍微不同的方法,你需要
i.col
功能(尚未记载):
1 :首先将键设置为
机场
列:setkey(DT,airport)
setkey(DT_LU,机场)
2 :使用
i。
功能(尚未记录)以获得所需的内容:j
中的DT1
tTime = i.thisTime,
fTime = thisTime [i.thisTime < thisTime],
fid = f_id [i.thisTime < thisTime]),
by = .EACHI]
> DT1
airport tid tTime fTime fid
1:a 1 5.1 6 1
2:a 1 5.1 6 2
3:a 1 5.1 8 2
4:a 4 5.1 6 1
5:a 4 5.1 6 2
6:a 4 5.1 8 2
7:a 3 5.1 6 1
8:a 3 5.1 6 2
9:a 3 5.1 8 2
10:d 2 6.2 7 2
11:d 2 6.2 9 1
12:d 5 6.2 7 2
13:d 5 6.2 9 1
一些解释:如果您加入两个使用相同列名称的数据表,您可以通过在
i。
之前的列名称引用i
中的数据类型的列。现在可以比较thisTime
从DT
与thisTime
DT_LU
。使用by = .EACHI
,您可以确保所有具有条件保留的组合都包含在结果数据表中。
或者,您可以通过以下方式实现相同:
DT2 < - DT_LU [DT,(airport = i.airport,
tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime [i.thisTimefid = f_id [i.thisTime allow.cartesian = TRUE]
其结果相同:
>相同(DT1,DT2)
[1] TRUE
值在一定边界内,您可以使用:
DT1
{
idx = i.thisTime<这个时间& thisTime-i.thisTime < 2
。(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime [idx],
fid = f_id [idx])
} ,
by = .EACHI]
其中:
> DT1
airport tid tTime fTime fid
1:a 1 5.1 6 1
2:a 1 5.1 6 2
3:a 4 5.1 6 1
4:a 4 5.1 6 2
5:a 3 5.1 6 1
6:a 3 5.1 6 2
7:d 2 6.2 7 2
8:d 5 6.2 7 2
当你将它与上一个结果进行比较时,你会看到第3,6,9,10行12已删除。
$ bDT < - data.table(t_id = c(1,4,2,3,5),
airport = c(a, a,d,a,d),
thisTime = c(5.1,5.1,6.2,5.1,6.2),
key = c ))
DT_LU< - data.table(f_id = c(rep(1,4),rep(2,3)),
airport = c(a b,c,d,a,d,e),
thisTime = c(6,7,8,9,6,7,8),
key = c(airport,thisTime))
I'm trying to understand
rolling joins
indata.table
. The data to reproduce this is given at the end.Given a data.table of transactions at an airport, at a given time:
> dt t_id airport thisTime 1: 1 a 5.1 2: 3 a 5.1 3: 2 a 6.2
(note
t_ids
1 & 3 have the same airport and time)and a lookup table of flights departing from airports:
> dt_lookup f_id airport thisTime 1: 1 a 6 2: 2 a 6 3: 1 b 7 4: 1 c 8 5: 2 d 7 6: 1 d 9 7: 2 e 8 > tables() NAME NROW NCOL MB COLS KEY [1,] dt 3 3 1 t_id,airport,thisTime airport,thisTime [2,] dt_lookup 7 3 1 f_id,airport,thisTime airport,thisTime
I would like to match all the transactions to all the next possible flights departing from that airport, to give:
t_id airport thisTime f_id 1 a 6 1 1 a 6 2 3 a 6 1 3 a 6 2
So I thought this would work:
> dt[dt_lookup, nomatch=0,roll=Inf] t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2
But it hasn't returned transactions
t_id == 1
.From the documentation it says:
Usually, there should be no duplicates in x’s key,...
However, I do have duplicates in my 'x key' (namely
airport
&thisTime
), and can't quite see/understand what's going on to meant_id = 1
gets removed from the output.Can anyone shed some light as to why
t_id = 1
is not returned, and how can I get the join to work for when I have duplicates?Data
library(data.table) dt <- data.table(t_id = seq(1:3), airport = c("a","a","a"), thisTime = c(5.1,6.2, 5.1), key=c( "airport","thisTime")) dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)), airport = c("a","b","c","d", "a","d","e"), thisTime = c(6,7,8,9, 6,7,8), key=c("airport","thisTime"))
解决方案The reason that
t_id = 1
doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF).
Let's consider somewhat larger datasets:
> DT t_id airport thisTime 1: 1 a 5.1 2: 4 a 5.1 3: 3 a 5.1 4: 2 d 6.2 5: 5 d 6.2 > DT_LU f_id airport thisTime 1: 1 a 6 2: 2 a 6 3: 2 a 8 4: 1 b 7 5: 1 c 8 6: 2 d 7 7: 1 d 9
When you perform a rolling join just like in your question:
DT[DT_LU, nomatch=0, roll=Inf]
you get:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 3 a 8 2 4: 5 d 7 2 5: 5 d 9 1
As you can see, from both the key combination
a, 5.1
andd, 6.2
the last row is used for the joined datatable. Because you useInf
as roll-value, all the future values are incorporated in the resulting datatable. When you use:DT[DT_LU, nomatch=0, roll=1]
you see that only the first value in the future is included:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 5 d 7 2
If you want the
f_id
's for for all combinations ofairport
&thisTime
whereDT$thisTime
is lower thanDT_LU$thisTime
, you can achieve that by creating a new variable (or replacing the existingthisTime
) by means of theceiling
function. An example where I create a new variablethisTime2
and then do a normal join withDT_LU
:DT[, thisTime2 := ceiling(thisTime)] setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]
which gives:
t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 4 a 5.1 6 1 3: 3 a 5.1 6 1 4: 1 a 5.1 6 2 5: 4 a 5.1 6 2 6: 3 a 5.1 6 2 7: 2 d 6.2 7 2 8: 5 d 6.2 7 2
Applied to the data you provided:
> dt[, thisTime2 := ceiling(thisTime)] > setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0] t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 3 a 5.1 6 1 3: 1 a 5.1 6 2 4: 3 a 5.1 6 2
When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the
i.col
functionality (which is not documented yet):1: First set the key to only the
airport
columns:setkey(DT, airport) setkey(DT_LU, airport)
2: Use the
i.col
functionality (which is not documented yet) inj
to get what you want as follows:DT1 <- DT_LU[DT, .(tid = i.t_id, tTime = i.thisTime, fTime = thisTime[i.thisTime < thisTime], fid = f_id[i.thisTime < thisTime]), by=.EACHI]
this gives you:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 1 5.1 8 2 4: a 4 5.1 6 1 5: a 4 5.1 6 2 6: a 4 5.1 8 2 7: a 3 5.1 6 1 8: a 3 5.1 6 2 9: a 3 5.1 8 2 10: d 2 6.2 7 2 11: d 2 6.2 9 1 12: d 5 6.2 7 2 13: d 5 6.2 9 1
Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in
i
by preceding the columnnames withi.
. Now it's possible to comparethisTime
fromDT
withthisTime
fromDT_LU
. Withby = .EACHI
you assure that all combinations for with the condition holds are included in the resulting datatable.Alternatively, you can achieve the same with:
DT2 <- DT_LU[DT, .(airport=i.airport, tid=i.t_id, tTime=i.thisTime, fTime=thisTime[i.thisTime < thisTime], fid=f_id[i.thisTime < thisTime]), allow.cartesian=TRUE]
which gives the same result:
> identical(DT1, DT2) [1] TRUE
When you only want to include future values within a certain boundary, you can use:
DT1 <- DT_LU[DT, { idx = i.thisTime < thisTime & thisTime - i.thisTime < 2 .(tid = i.t_id, tTime = i.thisTime, fTime = thisTime[idx], fid = f_id[idx]) }, by=.EACHI]
which gives:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 4 5.1 6 1 4: a 4 5.1 6 2 5: a 3 5.1 6 1 6: a 3 5.1 6 2 7: d 2 6.2 7 2 8: d 5 6.2 7 2
When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.
Data:
DT <- data.table(t_id = c(1,4,2,3,5), airport = c("a","a","d","a","d"), thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("airport","thisTime")) DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)), airport = c("a","b","c","d","a","d","e"), thisTime = c(6,7,8,9,6,7,8), key=c("airport","thisTime"))
这篇关于使用重复键滚动数据表上的连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!