使用不等式表达式的内部联接 [英] Inner join using an inequality expression

查看:107
本文介绍了使用不等式表达式的内部联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景



(此问题不需要,但可能有用)



使用重复的键在data.table上滚动加入



使用多种条件加入时的奇怪行为



数据

  library(data.table)## using version 1.9.6 
##到达时间表
dt_arrive< - 结构(list(txn_id = c(1L,1L,1L,1L,1L),place = c(place_a,
place_a,place_a place_a,place_a),arrival_minutes = c(515,
534,547,561,581),journey_id = 1:5),.names = c(txn_id,
place ,arrival_minutes,journey_id),class = c(data.table,
data.frame),row.names = c(NA,-5L),sorted = c(txn_id ,
place))

##离开时间表
dt_depart< - structure(list(txn_id = c(1L,1L,1L,1L),place = c (place_a,
place_a,place_a,place_a),arrival_minutes = c(489,507,
519,543),journey_id = 10:13).Names = c (txn_id,place,
arrival_minutes,journey_id),sorted = c(txn_id,place
),class = c(data.table data.frame),row.names = c(NA,-4L
))

> dt_arrive
txn_id place arrival_minutes journey_id
1:1 place_a 515 1
2:1 place_a 534 2
3:1 place_a 547 3
4:1 place_a 561 4
5:1 place_a 581 5

> dt_depart
txn_id place arrival_minutes journey_id
1:1 place_a 489 10
2:1 place_a 507 11
3:1 place_a 519 12
4:1 place_a 543 13

问题



我想加入到只有之后之后发生的 dt_depart $ journey_id $ c> dt_arrive $ journey_id 以 arrival_minutes (即 txn_id & place



例如,我想输出的是:



<$ c $ p> txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart
1 place_a 1 12 515 519
1 place_a 1 13 515 543
1 place_a 2 13 534 543

尝试



使用我已经构建的两个链接问题的方法

  setkey(dt_arrive,txn_id,place)
setkey(dt_depart,txn_id,place)

dt_join < - dt_arrive [dt_depart,
{
idx =(i.arrival_minutes> arrival_minutes)
。(journey_in_id = journey_id [idx],
journey_out_id = i.journey_id,
journey_place_arrive = arrival_minutes [idx],
journey_place_depart = i.arrival_minutes

},
by = .EACHI]

code> dt_depart ,因此在结果中包含 NA s - 表示正确加入:

  txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart 
1:1 place_a NA 10 NA 489
2:1 place_a NA 11 NA 507
3:1 place_a 1 12 515 519
4:1 place_a 1 13 515 543
5:1 place_a 2 13 534 543

我已经尝试使用 nomatch = 0 将其强制为'inner join',但这没有效果。



我可以使用 complete.cases 删除 NA rows,但我想知道是否有一种方法在查询本身内这样做?

解决方案

这里是unclever方法:采取十字/笛卡尔联接,然后过滤。

  merge(dt_arrive,dt_depart,allow.cartesian = TRUE)[arrival_minutes.y> arrival_minutes.x] 

#txn_id place arrival_minutes.x journey_id.x arrival_minutes.y journey_id.y
#1:1 place_a 515 1 519 12
#2:1 place_a 515 1 543 13
#3:1 place_a 534 2 543 13

,我们很可能吃掉很多记忆。


Background

(Not required for the question, but may be useful to read)

Rolling join on data.table with duplicate keys

Odd behaviour when joining with multiple conditions

Data

library(data.table)   ## using version 1.9.6
## arrival timetable
dt_arrive <- structure(list(txn_id = c(1L, 1L, 1L, 1L, 1L), place = c("place_a", 
"place_a", "place_a", "place_a", "place_a"), arrival_minutes = c(515, 
534, 547, 561, 581), journey_id = 1:5), .Names = c("txn_id", 
"place", "arrival_minutes", "journey_id"), class = c("data.table", 
"data.frame"), row.names = c(NA, -5L), sorted = c("txn_id", 
"place"))

## departure timetable
dt_depart <- structure(list(txn_id = c(1L, 1L, 1L, 1L), place = c("place_a", 
"place_a", "place_a", "place_a"), arrival_minutes = c(489, 507, 
519, 543), journey_id = 10:13), .Names = c("txn_id", "place", 
"arrival_minutes", "journey_id"), sorted = c("txn_id", "place"
), class = c("data.table", "data.frame"), row.names = c(NA, -4L
))

> dt_arrive
   txn_id   place arrival_minutes journey_id
1:      1 place_a             515          1
2:      1 place_a             534          2
3:      1 place_a             547          3
4:      1 place_a             561          4
5:      1 place_a             581          5

> dt_depart
   txn_id   place arrival_minutes journey_id
1:      1 place_a             489         10
2:      1 place_a             507         11
3:      1 place_a             519         12
4:      1 place_a             543         13

Question

I would like to join the arrivals to the departures for only those dt_depart$journey_id that occur after dt_arrive$journey_id in terms of arrival_minutes (i.e. an inner join on txn_id & place)

For example, the output I would like is:

   txn_id   place journey_in_id journey_out_id journey_place_arrive journey_place_depart
      1     place_a          1             12                  515                  519
      1     place_a          1             13                  515                  543
      1     place_a          2             13                  534                  543

Attempts

Using the method from the two linked questions I have constructed

setkey(dt_arrive, txn_id, place)
setkey(dt_depart, txn_id, place)

dt_join <- dt_arrive[dt_depart,
            {
              idx = (i.arrival_minutes > arrival_minutes)
              .(journey_in_id = journey_id[idx],
                journey_out_id = i.journey_id,
                journey_place_arrive = arrival_minutes[idx],
                journey_place_depart = i.arrival_minutes
              )
            },
            by=.EACHI]

But this gives everything from dt_depart, so includes NAs in the result - which suggests a 'right join':

   txn_id   place journey_in_id journey_out_id journey_place_arrive journey_place_depart
1:      1  place_a         NA             10                   NA                  489
2:      1  place_a         NA             11                   NA                  507
3:      1  place_a          1             12                  515                  519
4:      1  place_a          1             13                  515                  543
5:      1  place_a          2             13                  534                  543

I've tried using nomatch=0 to force it to 'inner join', but this hasn't worked.

I can use complete.cases to remove the NA rows, but I was wondering if there's a way of doing this within the query itself?

解决方案

Here's the unclever approach: take the cross/Cartesian join, and then filter.

merge(dt_arrive, dt_depart, allow.cartesian=TRUE)[arrival_minutes.y > arrival_minutes.x]

#    txn_id   place arrival_minutes.x journey_id.x arrival_minutes.y journey_id.y
# 1:      1 place_a               515            1               519           12
# 2:      1 place_a               515            1               543           13
# 3:      1 place_a               534            2               543           13

By taking the Cartesian join, we're liable to eat up a lot of memory.

这篇关于使用不等式表达式的内部联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆