使用不等式表达式的内部联接 [英] Inner join using an inequality expression

查看：107 发布时间：2017/3/12 10:47:08 r data.table

本文介绍了使用不等式表达式的内部联接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景

（此问题不需要，但可能有用）

数据

  library（data.table）## using version 1.9.6 
 ##到达时间表
 dt_arrive<  - 结构（list（txn_id = c（1L，1L，1L，1L，1L），place = c（place_a，
place_a，place_a place_a，place_a），arrival_minutes = c（515，
 534,547,561,581），journey_id = 1：5），.names = c（txn_id，
place ，arrival_minutes，journey_id），class = c（data.table，
data.frame），row.names = c（NA，-5L），sorted = c（txn_id ，
place））
 
 ##离开时间表
 dt_depart<  -  structure（list（txn_id = c（1L，1L，1L，1L），place = c （place_a，
place_a，place_a，place_a），arrival_minutes = c（489,507，
 519,543），journey_id = 10:13）.Names = c （txn_id，place，
arrival_minutes，journey_id），sorted = c（txn_id，place
），class = c（data.table data.frame），row.names = c（NA，-4L 
））
 
> dt_arrive 
 txn_id place arrival_minutes journey_id 
 1：1 place_a 515 1 
 2：1 place_a 534 2 
 3：1 place_a 547 3 
 4：1 place_a 561 4 
 5：1 place_a 581 5 
 
> dt_depart 
 txn_id place arrival_minutes journey_id 
 1：1 place_a 489 10 
 2：1 place_a 507 11 
 3：1 place_a 519 12 
 4：1 place_a 543 13

问题：

我想加入到只有在之后之后发生的 dt_depart $ journey_id $ c> dt_arrive $ journey_id 以 arrival_minutes （即 txn_id & place ）

例如，我想输出的是：

<$ c $ p> txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart 1 place_a 1 12 515 519 1 place_a 1 13 515 543 1 place_a 2 13 534 543
尝试

使用我已经构建的两个链接问题的方法
setkey（dt_arrive，txn_id，place） setkey（dt_depart，txn_id，place） dt_join < - dt_arrive [dt_depart， { idx =（i.arrival_minutes> arrival_minutes）。（journey_in_id = journey_id [idx]， journey_out_id = i.journey_id， journey_place_arrive = arrival_minutes [idx]， journey_place_depart = i.arrival_minutes ） }， by = .EACHI]
code> dt_depart ，因此在结果中包含 NA s - 表示正确加入：
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart 1：1 place_a NA 10 NA 489 2：1 place_a NA 11 NA 507 3：1 place_a 1 12 515 519 4：1 place_a 1 13 515 543 5：1 place_a 2 13 534 543
我已经尝试使用 nomatch = 0 将其强制为'inner join'，但这没有效果。

我可以使用 complete.cases 删除 NA rows，但我想知道是否有一种方法在查询本身内这样做？
解决方案
这里是unclever方法：采取十字/笛卡尔联接，然后过滤。
merge（dt_arrive，dt_depart，allow.cartesian = TRUE）[arrival_minutes.y> arrival_minutes.x] ＃txn_id place arrival_minutes.x journey_id.x arrival_minutes.y journey_id.y ＃1：1 place_a 515 1 519 12 ＃2：1 place_a 515 1 543 13 ＃3：1 place_a 534 2 543 13
，我们很可能吃掉很多记忆。

Background

(Not required for the question, but may be useful to read)

Rolling join on data.table with duplicate keys

Odd behaviour when joining with multiple conditions

Data
library(data.table) ## using version 1.9.6 ## arrival timetable dt_arrive <- structure(list(txn_id = c(1L, 1L, 1L, 1L, 1L), place = c("place_a", "place_a", "place_a", "place_a", "place_a"), arrival_minutes = c(515, 534, 547, 561, 581), journey_id = 1:5), .Names = c("txn_id", "place", "arrival_minutes", "journey_id"), class = c("data.table", "data.frame"), row.names = c(NA, -5L), sorted = c("txn_id", "place")) ## departure timetable dt_depart <- structure(list(txn_id = c(1L, 1L, 1L, 1L), place = c("place_a", "place_a", "place_a", "place_a"), arrival_minutes = c(489, 507, 519, 543), journey_id = 10:13), .Names = c("txn_id", "place", "arrival_minutes", "journey_id"), sorted = c("txn_id", "place" ), class = c("data.table", "data.frame"), row.names = c(NA, -4L )) > dt_arrive txn_id place arrival_minutes journey_id 1: 1 place_a 515 1 2: 1 place_a 534 2 3: 1 place_a 547 3 4: 1 place_a 561 4 5: 1 place_a 581 5 > dt_depart txn_id place arrival_minutes journey_id 1: 1 place_a 489 10 2: 1 place_a 507 11 3: 1 place_a 519 12 4: 1 place_a 543 13
Question

I would like to join the arrivals to the departures for only those dt_depart$journey_id that occur after dt_arrive$journey_id in terms of arrival_minutes (i.e. an inner join on txn_id & place)

For example, the output I would like is:
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart 1 place_a 1 12 515 519 1 place_a 1 13 515 543 1 place_a 2 13 534 543
Attempts

Using the method from the two linked questions I have constructed
setkey(dt_arrive, txn_id, place) setkey(dt_depart, txn_id, place) dt_join <- dt_arrive[dt_depart, { idx = (i.arrival_minutes > arrival_minutes) .(journey_in_id = journey_id[idx], journey_out_id = i.journey_id, journey_place_arrive = arrival_minutes[idx], journey_place_depart = i.arrival_minutes ) }, by=.EACHI]
But this gives everything from dt_depart, so includes NAs in the result - which suggests a 'right join':
txn_id place journey_in_id journey_out_id journey_place_arrive journey_place_depart 1: 1 place_a NA 10 NA 489 2: 1 place_a NA 11 NA 507 3: 1 place_a 1 12 515 519 4: 1 place_a 1 13 515 543 5: 1 place_a 2 13 534 543
I've tried using nomatch=0 to force it to 'inner join', but this hasn't worked.

I can use complete.cases to remove the NA rows, but I was wondering if there's a way of doing this within the query itself?
解决方案
Here's the unclever approach: take the cross/Cartesian join, and then filter.
merge(dt_arrive, dt_depart, allow.cartesian=TRUE)[arrival_minutes.y > arrival_minutes.x] # txn_id place arrival_minutes.x journey_id.x arrival_minutes.y journey_id.y # 1: 1 place_a 515 1 519 12 # 2: 1 place_a 515 1 543 13 # 3: 1 place_a 534 2 543 13
By taking the Cartesian join, we're liable to eat up a lot of memory.

这篇关于使用不等式表达式的内部联接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用不等式表达式的内部联接 [英] Inner join using an inequality expression

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用不等式表达式的内部联接 [英] Inner join using an inequality expression

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭