使用重复键在 data.table 上滚动连接 [英] Rolling join on data.table with duplicate keys
问题描述
我正在尝试理解 data.table
中的 rolling joins
.最后给出了重现这一点的数据.
给定一个指定时间在机场的交易数据表:
>dtt_id 机场这次1:1 5.12: 3 一个 5.13: 2 一个 6.2
(注意 t_ids
1 & 3 有相同的机场和时间)
以及从机场出发的航班查找表:
>dt_lookupf_id 机场这次1: 1 一个 62: 2 一个 63: 1 b 74: 1 c 85:2 天 76:1 天 97: 2 e 8>表()名称 NROW NCOL MB COLS KEY[1,] dt 3 3 1 t_id,airport,thisTime 机场,thisTime[2,] dt_lookup 7 3 1 f_id,airport,thisTime airport,thisTime
我想将所有交易与从该机场起飞的所有下一个可能的航班相匹配,以给出:
t_id airport thisTime f_id1 6 11 6 23 6 13 6 2
所以我认为这会奏效:
>dt[dt_lookup, nomatch=0,roll=Inf]t_id 机场这次 f_id1: 3 一个 6 12: 3 一个 6 2
但它没有返回交易t_id == 1
.
从文档它说:
<块引用>通常情况下,x 的 key 中应该没有重复项,...
但是,我的x 键"(即 airport
和 thisTime
)中确实有重复项,并且不太明白/理解这是什么意思t_id = 1
从输出中删除.
任何人都可以解释为什么 t_id = 1
没有被返回,以及当我有重复时如何让连接工作?
数据
library(data.table)dt <- data.table(t_id = seq(1:3),机场 = c("a","a","a"),thisTime = c(5.1,6.2, 5.1), key=c("机场","thisTime"))dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)),机场 = c("a","b","c","d","a","d","e"),这个时间 = c(6,7,8,9,6,7,8), key=c("机场","这个时间"))
t_id = 1
没有出现在输出中的原因是因为滚动连接采用了键所在的行-组合最后发生.来自文档(强调我的):
应用于最后一个连接列,通常是一个日期,但可以是任何有序变量,不规则,包括间隙.如果 roll=TRUE 并且我是row 匹配除最后一个 x 连接列之外的所有列,以及它在最后我加入的列落在一个空白处(包括在最后一个之后该组在 x 中的观察),则 x 中的主要值是向前滚动.此操作使用修改后的速度特别快二分查找.该操作也称为最后一次观察进行前进 (LOCF).
让我们考虑更大的数据集:
>DTt_id 机场这次1:1 5.12: 4 一个 5.13: 3 一个 5.14: 2 d 6.25:5 天 6.2>DT_LUf_id 机场这次1: 1 一个 62: 2 一个 63: 2 一个 84: 1 b 75: 1 c 86: 2 天 77:1 天 9
当您像在您的问题中一样执行滚动连接时:
DT[DT_LU, nomatch=0, roll=Inf]
你得到:
<块引用> t_id airport thisTime f_id1: 3 一个 6 12: 3 一个 6 23: 3 一个 8 24:5 天 7 25:5 天 9 1
如您所见,从组合键 a, 5.1
和 d, 6.2
中,最后一行用于连接的数据表.因为您使用 Inf
作为滚动值,所以所有未来值都包含在结果数据表中.当您使用:
DT[DT_LU, nomatch=0, roll=1]
你看到只包含未来的第一个值:
<块引用> t_id airport thisTime f_id1: 3 一个 6 12: 3 一个 6 23: 5 天 7 2
<小时>
如果您希望 f_id
用于 airport
和的所有组合thisTime
其中 DT$thisTime
低于 DT_LU$thisTime
,您可以通过创建一个新变量(或替换现有的 thisTime
) 通过 ceiling
函数.我创建一个新变量 thisTime2
然后与 DT_LU
进行正常连接的示例:
DT[, thisTime2 :=天花板(thisTime)]setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]
给出:
<块引用> t_id airport thisTime thisTime2 f_id1: 1 5.1 6 12: 4 5.1 6 13: 3 5.1 6 14:1 5.1 6 25: 4 5.1 6 26: 3 5.1 6 27:2 天 6.2 7 28:5 天 6.2 7 2
应用于您提供的数据:
<块引用>>dt[, thisTime2 := 天花板(thisTime)]>setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0]t_id 机场 thisTime thisTime2 f_id1: 1 5.1 6 12: 3 5.1 6 13: 1 5.1 6 24: 3 5.1 6 2
<小时>
当您想要包含所有未来值而不是仅包含第一个值时,您需要一种稍微不同的方法,您将需要 i.col
功能(尚未记录):
1:首先将键设置为仅airport
列:
setkey(DT, 机场)setkey(DT_LU,机场)
2:使用 j
中的 i.col
功能(尚未记录)来获得您想要的内容,如下所示:
DT1 <- DT_LU[DT, .(tid = i.t_id,tTime = i.thisTime,fTime = thisTime[i.thisTime <这次],fid = f_id[i.thisTime <这次]),by=.EACHI]
这给你:
<块引用>>DT1机场 tid tTime fTime fid1:1 5.1 6 12:一个 1 5.1 6 23:1 5.1 8 24:一个 4 5.1 6 15:一个 4 5.1 6 26:一个 4 5.1 8 27:一个 3 5.1 6 18:一个 3 5.1 6 29:一个 3 5.1 8 210:d 2 6.2 7 211:d 2 6.2 9 112:d 5 6.2 7 213:d 5 6.2 9 1
一些解释:如果您连接两个使用相同列名的数据表,您可以通过在列名前加上i来引用
i
中数据表的列.代码>.现在可以将 DT
中的 thisTime
与 DT_LU
中的 thisTime
进行比较.使用 by = .EACHI
可以确保条件成立的所有组合都包含在结果数据表中.
或者,您可以通过以下方式实现:
DT2 <- DT_LU[DT, .(airport=i.airport,tid=i.t_id,tTime=i.thisTime,fTime=thisTime[i.thisTime <这次],fid=f_id[i.thisTime <这次]),allow.cartesian=TRUE]
给出相同的结果:
>相同(DT1,DT2)[1] 真
当您只想包含某个边界内的未来值时,您可以使用:
DT1 <- DT_LU[DT,{idx = i.thisTime <这一次&thisTime - i.thisTime <2.(tid = i.t_id,tTime = i.thisTime,fTime = thisTime[idx],fid = f_id[idx])},by=.EACHI]
给出:
<块引用>>DT1机场 tid tTime fTime fid1:1 5.1 6 12:一个 1 5.1 6 23:一个 4 5.1 6 14:一个 4 5.1 6 25:一个 3 5.1 6 16:一个 3 5.1 6 27:d 2 6.2 7 28:d 5 6.2 7 2
当您将其与之前的结果进行比较时,您会发现现在第 3、6、9、10 和 12 行已被删除.
<小时>数据:
DT <- data.table(t_id = c(1,4,2,3,5),机场 = c("a","a","d","a","d"),这个时间 = c(5.1, 5.1, 6.2, 5.1, 6.2),key=c("机场","这个时间"))DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)),机场 = c("a","b","c","d","a","d","e"),这个时间 = c(6,7,8,9,6,7,8),key=c("机场","这个时间"))
I'm trying to understand rolling joins
in data.table
. The data to reproduce this is given at the end.
Given a data.table of transactions at an airport, at a given time:
> dt
t_id airport thisTime
1: 1 a 5.1
2: 3 a 5.1
3: 2 a 6.2
(note t_ids
1 & 3 have the same airport and time)
and a lookup table of flights departing from airports:
> dt_lookup
f_id airport thisTime
1: 1 a 6
2: 2 a 6
3: 1 b 7
4: 1 c 8
5: 2 d 7
6: 1 d 9
7: 2 e 8
> tables()
NAME NROW NCOL MB COLS KEY
[1,] dt 3 3 1 t_id,airport,thisTime airport,thisTime
[2,] dt_lookup 7 3 1 f_id,airport,thisTime airport,thisTime
I would like to match all the transactions to all the next possible flights departing from that airport, to give:
t_id airport thisTime f_id
1 a 6 1
1 a 6 2
3 a 6 1
3 a 6 2
So I thought this would work:
> dt[dt_lookup, nomatch=0,roll=Inf]
t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
But it hasn't returned transactions t_id == 1
.
From the documentation it says:
Usually, there should be no duplicates in x’s key,...
However, I do have duplicates in my 'x key' (namely airport
& thisTime
), and can't quite see/understand what's going on to mean t_id = 1
gets removed from the output.
Can anyone shed some light as to why t_id = 1
is not returned, and how can I get the join to work for when I have duplicates?
Data
library(data.table)
dt <- data.table(t_id = seq(1:3),
airport = c("a","a","a"),
thisTime = c(5.1,6.2, 5.1), key=c( "airport","thisTime"))
dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)),
airport = c("a","b","c","d",
"a","d","e"),
thisTime = c(6,7,8,9,
6,7,8), key=c("airport","thisTime"))
The reason that t_id = 1
doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF).
Let's consider somewhat larger datasets:
> DT
t_id airport thisTime
1: 1 a 5.1
2: 4 a 5.1
3: 3 a 5.1
4: 2 d 6.2
5: 5 d 6.2
> DT_LU
f_id airport thisTime
1: 1 a 6
2: 2 a 6
3: 2 a 8
4: 1 b 7
5: 1 c 8
6: 2 d 7
7: 1 d 9
When you perform a rolling join just like in your question:
DT[DT_LU, nomatch=0, roll=Inf]
you get:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 3 a 8 2 4: 5 d 7 2 5: 5 d 9 1
As you can see, from both the key combination a, 5.1
and d, 6.2
the last row is used for the joined datatable. Because you use Inf
as roll-value, all the future values are incorporated in the resulting datatable. When you use:
DT[DT_LU, nomatch=0, roll=1]
you see that only the first value in the future is included:
t_id airport thisTime f_id 1: 3 a 6 1 2: 3 a 6 2 3: 5 d 7 2
If you want the f_id
's for for all combinations of airport
& thisTime
where DT$thisTime
is lower than DT_LU$thisTime
, you can achieve that by creating a new variable (or replacing the existing thisTime
) by means of the ceiling
function. An example where I create a new variable thisTime2
and then do a normal join with DT_LU
:
DT[, thisTime2 := ceiling(thisTime)]
setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]
which gives:
t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 4 a 5.1 6 1 3: 3 a 5.1 6 1 4: 1 a 5.1 6 2 5: 4 a 5.1 6 2 6: 3 a 5.1 6 2 7: 2 d 6.2 7 2 8: 5 d 6.2 7 2
Applied to the data you provided:
> dt[, thisTime2 := ceiling(thisTime)] > setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0] t_id airport thisTime thisTime2 f_id 1: 1 a 5.1 6 1 2: 3 a 5.1 6 1 3: 1 a 5.1 6 2 4: 3 a 5.1 6 2
When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the i.col
functionality (which is not documented yet):
1: First set the key to only the airport
columns:
setkey(DT, airport)
setkey(DT_LU, airport)
2: Use the i.col
functionality (which is not documented yet) in j
to get what you want as follows:
DT1 <- DT_LU[DT, .(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[i.thisTime < thisTime],
fid = f_id[i.thisTime < thisTime]),
by=.EACHI]
this gives you:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 1 5.1 8 2 4: a 4 5.1 6 1 5: a 4 5.1 6 2 6: a 4 5.1 8 2 7: a 3 5.1 6 1 8: a 3 5.1 6 2 9: a 3 5.1 8 2 10: d 2 6.2 7 2 11: d 2 6.2 9 1 12: d 5 6.2 7 2 13: d 5 6.2 9 1
Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in i
by preceding the columnnames with i.
. Now it's possible to compare thisTime
from DT
with thisTime
from DT_LU
. With by = .EACHI
you assure that all combinations for with the condition holds are included in the resulting datatable.
Alternatively, you can achieve the same with:
DT2 <- DT_LU[DT, .(airport=i.airport,
tid=i.t_id,
tTime=i.thisTime,
fTime=thisTime[i.thisTime < thisTime],
fid=f_id[i.thisTime < thisTime]),
allow.cartesian=TRUE]
which gives the same result:
> identical(DT1, DT2)
[1] TRUE
When you only want to include future values within a certain boundary, you can use:
DT1 <- DT_LU[DT,
{
idx = i.thisTime < thisTime & thisTime - i.thisTime < 2
.(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[idx],
fid = f_id[idx])
},
by=.EACHI]
which gives:
> DT1 airport tid tTime fTime fid 1: a 1 5.1 6 1 2: a 1 5.1 6 2 3: a 4 5.1 6 1 4: a 4 5.1 6 2 5: a 3 5.1 6 1 6: a 3 5.1 6 2 7: d 2 6.2 7 2 8: d 5 6.2 7 2
When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.
Data:
DT <- data.table(t_id = c(1,4,2,3,5),
airport = c("a","a","d","a","d"),
thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2),
key=c("airport","thisTime"))
DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)),
airport = c("a","b","c","d","a","d","e"),
thisTime = c(6,7,8,9,6,7,8),
key=c("airport","thisTime"))
这篇关于使用重复键在 data.table 上滚动连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!