使用不等式表达式的内部联接 [英] Inner join using an inequality expression
问题描述
背景
(此问题不需要,但可能有用)
数据
library(data.table)## using version 1.9.6
##到达时间表
dt_arrive< - 结构(list(txn_id = c(1L,1L,1L,1L,1L),place = c(place_a,
place_a,place_a place_a,place_a),arrival_minutes = c(515,
534,547,561,581),journey_id = 1:5),.names = c(txn_id,
place ,arrival_minutes,journey_id),class = c(data.table,
data.frame),row.names = c(NA,-5L),sorted = c(txn_id ,
place))
##离开时间表
dt_depart< - structure(list(txn_id = c(1L,1L,1L,1L),place = c (place_a,
place_a,place_a,place_a),arrival_minutes = c(489,507,
519,543),journey_id = 10:13).Names = c (txn_id,place,
arrival_minutes,journey_id),sorted = c(txn_id,place
),class = c(data.table data.frame),row.names = c(NA,-4L
))
> dt_arrive
txn_id place arrival_minutes journey_id
1:1 place_a 515 1
2:1 place_a 534 2
3:1 place_a 547 3
4:1 place_a 561 4
5:1 place_a 581 5
> dt_depart
txn_id place arrival_minutes journey_id
1:1 place_a 489 10
2:1 place_a 507 11
3:1 place_a 519 12
4:1 place_a 543 13
问题:
我想加入到只有在之后之后发生的 dt_depart $ journey_id
$ c> dt_arrive $ journey_id 以 arrival_minutes
(即 txn_id
& place
)
例如,我想输出的是:
<$ c $ p>
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart
1 place_a 1 12 515 519
1 place_a 1 13 515 543
1 place_a 2 13 534 543
尝试
使用我已经构建的两个链接问题的方法
setkey(dt_arrive,txn_id,place)
setkey(dt_depart,txn_id,place)
dt_join < - dt_arrive [dt_depart,
{
idx =(i.arrival_minutes> arrival_minutes)
。(journey_in_id = journey_id [idx],
journey_out_id = i.journey_id,
journey_place_arrive = arrival_minutes [idx],
journey_place_depart = i.arrival_minutes
)
},
by = .EACHI]
code> dt_depart ,因此在结果中包含 NA
s - 表示正确加入:
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart
1:1 place_a NA 10 NA 489
2:1 place_a NA 11 NA 507
3:1 place_a 1 12 515 519
4:1 place_a 1 13 515 543
5:1 place_a 2 13 534 543
我已经尝试使用 nomatch = 0
将其强制为'inner join',但这没有效果。
我可以使用 complete.cases
删除 NA
rows,但我想知道是否有一种方法在查询本身内这样做?
这里是unclever方法:采取十字/笛卡尔联接,然后过滤。
merge(dt_arrive,dt_depart,allow.cartesian = TRUE)[arrival_minutes.y> arrival_minutes.x]
#txn_id place arrival_minutes.x journey_id.x arrival_minutes.y journey_id.y
#1:1 place_a 515 1 519 12
#2:1 place_a 515 1 543 13
#3:1 place_a 534 2 543 13
,我们很可能吃掉很多记忆。
Background
(Not required for the question, but may be useful to read)
Rolling join on data.table with duplicate keys
Odd behaviour when joining with multiple conditions
Data
library(data.table) ## using version 1.9.6
## arrival timetable
dt_arrive <- structure(list(txn_id = c(1L, 1L, 1L, 1L, 1L), place = c("place_a",
"place_a", "place_a", "place_a", "place_a"), arrival_minutes = c(515,
534, 547, 561, 581), journey_id = 1:5), .Names = c("txn_id",
"place", "arrival_minutes", "journey_id"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), sorted = c("txn_id",
"place"))
## departure timetable
dt_depart <- structure(list(txn_id = c(1L, 1L, 1L, 1L), place = c("place_a",
"place_a", "place_a", "place_a"), arrival_minutes = c(489, 507,
519, 543), journey_id = 10:13), .Names = c("txn_id", "place",
"arrival_minutes", "journey_id"), sorted = c("txn_id", "place"
), class = c("data.table", "data.frame"), row.names = c(NA, -4L
))
> dt_arrive
txn_id place arrival_minutes journey_id
1: 1 place_a 515 1
2: 1 place_a 534 2
3: 1 place_a 547 3
4: 1 place_a 561 4
5: 1 place_a 581 5
> dt_depart
txn_id place arrival_minutes journey_id
1: 1 place_a 489 10
2: 1 place_a 507 11
3: 1 place_a 519 12
4: 1 place_a 543 13
Question
I would like to join the arrivals to the departures for only those dt_depart$journey_id
that occur after dt_arrive$journey_id
in terms of arrival_minutes
(i.e. an inner join on txn_id
& place
)
For example, the output I would like is:
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart
1 place_a 1 12 515 519
1 place_a 1 13 515 543
1 place_a 2 13 534 543
Attempts
Using the method from the two linked questions I have constructed
setkey(dt_arrive, txn_id, place)
setkey(dt_depart, txn_id, place)
dt_join <- dt_arrive[dt_depart,
{
idx = (i.arrival_minutes > arrival_minutes)
.(journey_in_id = journey_id[idx],
journey_out_id = i.journey_id,
journey_place_arrive = arrival_minutes[idx],
journey_place_depart = i.arrival_minutes
)
},
by=.EACHI]
But this gives everything from dt_depart
, so includes NA
s in the result - which suggests a 'right join':
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart
1: 1 place_a NA 10 NA 489
2: 1 place_a NA 11 NA 507
3: 1 place_a 1 12 515 519
4: 1 place_a 1 13 515 543
5: 1 place_a 2 13 534 543
I've tried using nomatch=0
to force it to 'inner join', but this hasn't worked.
I can use complete.cases
to remove the NA
rows, but I was wondering if there's a way of doing this within the query itself?
Here's the unclever approach: take the cross/Cartesian join, and then filter.
merge(dt_arrive, dt_depart, allow.cartesian=TRUE)[arrival_minutes.y > arrival_minutes.x]
# txn_id place arrival_minutes.x journey_id.x arrival_minutes.y journey_id.y
# 1: 1 place_a 515 1 519 12
# 2: 1 place_a 515 1 543 13
# 3: 1 place_a 534 2 543 13
By taking the Cartesian join, we're liable to eat up a lot of memory.
这篇关于使用不等式表达式的内部联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!